Performance degradation in commit ac1d794

Started by Васильев Дмитрийabout 10 years ago94 messages

d.vasilyev@postgrespro.ru

about 10 years ago

Hello hackers!

I suddenly found commit ac1d794 gives up to 3 times performance degradation.

I tried to run pgbench -s 1000 -j 48 -c 48 -S -M prepared on 70 CPU-core
machine:
commit ac1d794 gives me 363,474 tps
and previous commit a05dc4d gives me 956,146
and master( 3d0c50f ) with revert ac1d794 gives me 969,265

it's shocking

---
Dmitry Vasilyev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Васильев Дмитрий (#1)

Re: Performance degradation in commit ac1d794

On December 25, 2015 6:08:15 PM GMT+01:00, "Васильев Дмитрий" <d.vasilyev@postgrespro.ru> wrote:

Hello hackers!

I suddenly found commit ac1d794 gives up to 3 times performance
degradation.

I tried to run pgbench -s 1000 -j 48 -c 48 -S -M prepared on 70
CPU-core
machine:
commit ac1d794 gives me 363,474 tps
and previous commit a05dc4d gives me 956,146
and master( 3d0c50f ) with revert ac1d794 gives me 969,265

it's shocking

You're talking about http://git.postgresql.org/gitweb/?p=postgresql.git;a=blobdiff;f=src/backend/libpq/be-secure.c;h=2ddcf428f89fd12c230d6f417c2f707fbd97bf39;hp=26d8faaf773a818b388b899b8d83d617bdf7af9b;hb=ac1d794;hpb=a05dc4d7fd57d4ae084c1f0801973e5c1a1aa26e

If so, could you provide a hierarchical before/after profile?

Andres
Hi,

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Васильев Дмитрий

d.vasilyev@postgrespro.ru

about 10 years ago

In reply to: Andres Freund (#2)

Re: Performance degradation in commit ac1d794

2015-12-25 20:18 GMT+03:00 Andres Freund <andres@anarazel.de>:

On December 25, 2015 6:08:15 PM GMT+01:00, "Васильев Дмитрий" <
d.vasilyev@postgrespro.ru> wrote:

Hello hackers!

I suddenly found commit ac1d794 gives up to 3 times performance
degradation.

I tried to run pgbench -s 1000 -j 48 -c 48 -S -M prepared on 70
CPU-core
machine:
commit ac1d794 gives me 363,474 tps
and previous commit a05dc4d gives me 956,146
and master( 3d0c50f ) with revert ac1d794 gives me 969,265

it's shocking

You're talking about
http://git.postgresql.org/gitweb/?p=postgresql.git;a=blobdiff;f=src/backend/libpq/be-secure.c;h=2ddcf428f89fd12c230d6f417c2f707fbd97bf39;hp=26d8faaf773a818b388b899b8d83d617bdf7af9b;hb=ac1d794;hpb=a05dc4d7fd57d4ae084c1f0801973e5c1a1aa26e

If so, could you provide a hierarchical before/after profile?

Andres
Hi,

---
Please excuse brevity and formatting - I am writing this on my mobile
phone.

> You're talking about
http://git.postgresql.org/gitweb/?p=postgresql.git;a=blobdiff;f=src/backend/libpq/be-secure.c;h=2ddcf428f89fd12c230d6f417c2f707fbd97bf39;hp=26d8faaf773a818b388b899b8d83d617bdf7af9b;hb=ac1d794;hpb=a05dc4d7fd57d4ae084c1f0801973e5c1a1aa26e

Yes, about this.

If so, could you provide a hierarchical before/after profile?

---
Dmitry Vasilyev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Васильев Дмитрий

d.vasilyev@postgrespro.ru

about 10 years ago

In reply to: Васильев Дмитрий (#3)

Re: Performance degradation in commit ac1d794

2015-12-25 20:27 GMT+03:00 Васильев Дмитрий <d.vasilyev@postgrespro.ru>:

2015-12-25 20:18 GMT+03:00 Andres Freund <andres@anarazel.de>:

On December 25, 2015 6:08:15 PM GMT+01:00, "Васильев Дмитрий" <
d.vasilyev@postgrespro.ru> wrote:

Hello hackers!

I suddenly found commit ac1d794 gives up to 3 times performance
degradation.

I tried to run pgbench -s 1000 -j 48 -c 48 -S -M prepared on 70
CPU-core
machine:
commit ac1d794 gives me 363,474 tps
and previous commit a05dc4d gives me 956,146
and master( 3d0c50f ) with revert ac1d794 gives me 969,265

it's shocking

You're talking about
http://git.postgresql.org/gitweb/?p=postgresql.git;a=blobdiff;f=src/backend/libpq/be-secure.c;h=2ddcf428f89fd12c230d6f417c2f707fbd97bf39;hp=26d8faaf773a818b388b899b8d83d617bdf7af9b;hb=ac1d794;hpb=a05dc4d7fd57d4ae084c1f0801973e5c1a1aa26e

If so, could you provide a hierarchical before/after profile?

Andres
Hi,

---
Please excuse brevity and formatting - I am writing this on my mobile
phone.

> You're talking about
http://git.postgresql.org/gitweb/?p=postgresql.git;a=blobdiff;f=src/backend/libpq/be-secure.c;h=2ddcf428f89fd12c230d6f417c2f707fbd97bf39;hp=26d8faaf773a818b388b899b8d83d617bdf7af9b;hb=ac1d794;hpb=a05dc4d7fd57d4ae084c1f0801973e5c1a1aa26e

Yes, about this.

If so, could you provide a hierarchical before/after profile?

Performance | Git hash commit | Date
~ 360k tps | c3e7c24a1d60dc6ad56e2a0723399f1570c54224 | Thu Nov 12
09:12:18 2015 -0500
~ 360k tps | ac1d7945f866b1928c2554c0f80fd52d7f977772 | Thu Nov 12
09:00:33 2015 -0500
~ 960k tps | a05dc4d7fd57d4ae084c1f0801973e5c1a1aa26e | Thu Nov 12
07:40:31 2015 -0500

---
Dmitry Vasilyev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

I came across it by studying the results:

vanilla 9.5 = 30020c3fc3b6de5592978313df929d370f5770ce
vanilla 9.6 = c4a8812cf64b142685e39a69694c5276601f40e4

info | clients | tps
-----------------------+---------+---------
vanilla 9.5 | 1 | 30321
vanilla 9.5 | 8 | 216542
vanilla 9.5 | 16 | 412526
vanilla 9.5 | 32 | 751331
vanilla 9.5 | 48 | 956146
<- this point
vanilla 9.5 | 56 | 990122
vanilla 9.5 | 64 | 842436
vanilla 9.5 | 72 | 913272
vanilla 9.5 | 82 | 659332
vanilla 9.5 | 92 | 630111
vanilla 9.5 | 96 | 616863
vanilla 9.5 | 110 | 592080
vanilla 9.5 | 120 | 575831
vanilla 9.5 | 130 | 557521
vanilla 9.5 | 140 | 537951
vanilla 9.5 | 150 | 517019
vanilla 9.5 | 160 | 502312
vanilla 9.5 | 170 | 489162
vanilla 9.5 | 180 | 477178
vanilla 9.5 | 190 | 464620
vanilla 9.6 | 1 | 31738
vanilla 9.6 | 8 | 219692
vanilla 9.6 | 16 | 422933
vanilla 9.6 | 32 | 375546
vanilla 9.6 | 48 | 363474
<- this point
vanilla 9.6 | 56 | 352943
vanilla 9.6 | 64 | 334498
vanilla 9.6 | 72 | 369802
vanilla 9.6 | 82 | 604867
vanilla 9.6 | 92 | 871048
vanilla 9.6 | 96 | 969265
vanilla 9.6 | 105 | 996794
vanilla 9.6 | 110 | 932853
vanilla 9.6 | 115 | 758485
vanilla 9.6 | 120 | 721365
vanilla 9.6 | 125 | 632265
vanilla 9.6 | 130 | 624666
vanilla 9.6 | 135 | 582120
vanilla 9.6 | 140 | 583080
vanilla 9.6 | 150 | 555608
vanilla 9.6 | 160 | 533340
vanilla 9.6 | 170 | 520308
vanilla 9.6 | 180 | 504536
vanilla 9.6 | 190 | 496967

---
Dmitry Vasilyev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Васильев Дмитрий (#3)

Re: Performance degradation in commit ac1d794

On December 25, 2015 6:27:06 PM GMT+01:00, "Васильев Дмитрий"

If so, could you provide a hierarchical before/after profile?

Performance | Git hash commit | Date
~ 360k tps | c3e7c24a1d60dc6ad56e2a0723399f1570c54224 | Thu Nov 12
09:12:18
2015 -0500
~ 360k tps | ac1d7945f866b1928c2554c0f80fd52d7f977772 | Thu Nov 12
09:00:33 2015 -0500
~ 960k tps | a05dc4d7fd57d4ae084c1f0801973e5c1a1aa26e | Thu Nov 12
07:40:31 2015 -0500

Profile as in perf oprofile or something.

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Васильев Дмитрий

d.vasilyev@postgrespro.ru

about 10 years ago

In reply to: Andres Freund (#5)

Re: Performance degradation in commit ac1d794

2015-12-25 20:44 GMT+03:00 Andres Freund <andres@anarazel.de>:

On December 25, 2015 6:27:06 PM GMT+01:00, "Васильев Дмитрий"

If so, could you provide a hierarchical before/after profile?

Performance | Git hash commit | Date
~ 360k tps | c3e7c24a1d60dc6ad56e2a0723399f1570c54224 | Thu Nov 12
09:12:18
2015 -0500
~ 360k tps | ac1d7945f866b1928c2554c0f80fd52d7f977772 | Thu Nov 12
09:00:33 2015 -0500
~ 960k tps | a05dc4d7fd57d4ae084c1f0801973e5c1a1aa26e | Thu Nov 12
07:40:31 2015 -0500

Profile as in perf oprofile or something.

---
Please excuse brevity and formatting - I am writing this on my mobile
phone.

ac1d794:

Samples: 1M of event 'cycles', Event count (approx.): 816922259995, UID:
pgpro
Overhead Shared Object Symbol

69,72% [kernel] [k] _raw_spin_lock_irqsave
1,43% postgres [.] _bt_compare
1,19% postgres [.] LWLockAcquire
0,99% postgres [.] hash_search_with_hash_value
0,61% postgres [.] PinBuffer

0,46% postgres [.] GetSnapshotData

a05dc4d:

Samples: 1M of event 'cycles', Event count (approx.): 508150718694, UID:
pgpro
Overhead Shared Object Symbol

4,77% postgres [.] GetSnapshotData
4,30% postgres [.] _bt_compare
3,13% postgres [.] hash_search_with_hash_value
3,08% postgres [.] LWLockAcquire
2,09% postgres [.] LWLockRelease
2,03% postgres [.] PinBuffer

Perf record generate big traffic:

time perf record -u pgpro -g --call-graph=dwarf
^C[ perf record: Woken up 0 times to write data ]
Warning:
Processed 1078453 events and lost 18257 chunks!
Check IO/CPU overload!
[ perf record: Captured and wrote 8507.985 MB perf.data (1055663 samples) ]
real 0m8.791s
user 0m0.678s
sys 0m8.120s

If you want i give you ssh-access.

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Васильев Дмитрий (#6)

Re: Performance degradation in commit ac1d794

=?UTF-8?B?0JLQsNGB0LjQu9GM0LXQsiDQlNC80LjRgtGA0LjQuQ==?= <d.vasilyev@postgrespro.ru> writes:

Samples: 1M of event 'cycles', Event count (approx.): 816922259995, UID:
pgpro
Overhead Shared Object Symbol

69,72% [kernel] [k] _raw_spin_lock_irqsave
1,43% postgres [.] _bt_compare
1,19% postgres [.] LWLockAcquire
0,99% postgres [.] hash_search_with_hash_value
0,61% postgres [.] PinBuffer

Seems like what you've got here is a kernel bug.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Tom Lane (#7)

Re: Performance degradation in commit ac1d794

On December 25, 2015 7:10:23 PM GMT+01:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:

=?UTF-8?B?0JLQsNGB0LjQu9GM0LXQsiDQlNC80LjRgtGA0LjQuQ==?=
<d.vasilyev@postgrespro.ru> writes:

�� Samples: 1M of event 'cycles', Event count (approx.):

816922259995, UID:

pgpro
Overhead Shared Object Symbol

69,72% [kernel] [k] _raw_spin_lock_irqsave
1,43% postgres [.] _bt_compare
1,19% postgres [.] LWLockAcquire
0,99% postgres [.] hash_search_with_hash_value
0,61% postgres [.] PinBuffer

Seems like what you've got here is a kernel bug.

I wouldn't go as far as calling it a kernel bug. Were still doing 300k tps. And were triggering the performance degradation by adding another socket (IIRC) to the poll(2) call.

It certainly be interesting to see the expanded tree below the spinlock. I wonder if this is related to directed wakeups.

Andres

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Andres Freund (#8)

Re: Performance degradation in commit ac1d794

Andres Freund <andres@anarazel.de> writes:

On December 25, 2015 7:10:23 PM GMT+01:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Seems like what you've got here is a kernel bug.

I wouldn't go as far as calling it a kernel bug. Were still doing 300k tps. And were triggering the performance degradation by adding another socket (IIRC) to the poll(2) call.

Hmm. And all those FDs point to the same pipe. I wonder if we're looking
at contention for some pipe-related data structure inside the kernel.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Васильев Дмитрий

d.vasilyev@postgrespro.ru

about 10 years ago

In reply to: Tom Lane (#9)

Re: Performance degradation in commit ac1d794

2015-12-25 21:28 GMT+03:00 Tom Lane <tgl@sss.pgh.pa.us>:

Andres Freund <andres@anarazel.de> writes:

On December 25, 2015 7:10:23 PM GMT+01:00, Tom Lane <tgl@sss.pgh.pa.us>

wrote:

Seems like what you've got here is a kernel bug.

I wouldn't go as far as calling it a kernel bug. Were still doing 300k

tps. And were triggering the performance degradation by adding another
socket (IIRC) to the poll(2) call.

Hmm. And all those FDs point to the same pipe. I wonder if we're looking
at contention for some pipe-related data structure inside the kernel.

regards, tom lane

I did bt on backends and found it in following state:

#0 0x00007f77b0e5bb60 in __poll_nocancel () from /lib64/libc.so.6
#1 0x00000000006a7cd0 in WaitLatchOrSocket (latch=0x7f779e2e96c4,
wakeEvents=wakeEvents@entry=19, sock=9, timeout=timeout@entry=0) at
pg_latch.c:333
#2 0x0000000000612c7d in secure_read (port=0x17e6af0, ptr=0xcc94a0
<PqRecvBuffer>, len=8192) at be-secure.c:147
#3 0x000000000061be36 in pq_recvbuf () at pqcomm.c:915
#4 pq_getbyte () at pqcomm.c:958
#5 0x0000000000728ad5 in SocketBackend (inBuf=0x7ffd8b6b1460) at
postgres.c:345

Perf shows _raw_spin_lock_irqsave call remove_wait_queue add_wait_queue
There’s screenshots: http://i.imgur.com/pux2bGJ.png
http://i.imgur.com/LJQbm2V.png

---
Dmitry Vasilyev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

#11

Васильев Дмитрий

d.vasilyev@postgrespro.ru

about 10 years ago

In reply to: Васильев Дмитрий (#10)

Re: Performance degradation in commit ac1d794

2015-12-25 22:42 GMT+03:00 Васильев Дмитрий <d.vasilyev@postgrespro.ru>:

2015-12-25 21:28 GMT+03:00 Tom Lane <tgl@sss.pgh.pa.us>:

Andres Freund <andres@anarazel.de> writes:

On December 25, 2015 7:10:23 PM GMT+01:00, Tom Lane <tgl@sss.pgh.pa.us>

wrote:

Seems like what you've got here is a kernel bug.

I wouldn't go as far as calling it a kernel bug. Were still doing 300k

tps. And were triggering the performance degradation by adding another
socket (IIRC) to the poll(2) call.

Hmm. And all those FDs point to the same pipe. I wonder if we're looking
at contention for some pipe-related data structure inside the kernel.

regards, tom lane

I did bt on backends and found it in following state:

#0 0x00007f77b0e5bb60 in __poll_nocancel () from /lib64/libc.so.6
#1 0x00000000006a7cd0 in WaitLatchOrSocket (latch=0x7f779e2e96c4,
wakeEvents=wakeEvents@entry=19, sock=9, timeout=timeout@entry=0) at
pg_latch.c:333
#2 0x0000000000612c7d in secure_read (port=0x17e6af0, ptr=0xcc94a0
<PqRecvBuffer>, len=8192) at be-secure.c:147
#3 0x000000000061be36 in pq_recvbuf () at pqcomm.c:915
#4 pq_getbyte () at pqcomm.c:958
#5 0x0000000000728ad5 in SocketBackend (inBuf=0x7ffd8b6b1460) at
postgres.c:345

Perf shows _raw_spin_lock_irqsave call remove_wait_queue add_wait_queue
There’s screenshots: http://i.imgur.com/pux2bGJ.png
http://i.imgur.com/LJQbm2V.png

---
Dmitry Vasilyev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

I’m sorry I meant remove_wait_queue and add_wait_queue functions call
_raw_spin_lock_irqsave what holds main processor time.

uname -a: 3.10.0-229.20.1.el7.x86_64 #1 SMP Tue Nov 3 19:10:07 UTC 2015
x86_64 x86_64 x86_64 GNU/Linux

---
Dmitry Vasilyev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

#12

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Tom Lane (#9)

Re: Performance degradation in commit ac1d794

On 2015-12-25 13:28:55 -0500, Tom Lane wrote:

Hmm. And all those FDs point to the same pipe. I wonder if we're looking
at contention for some pipe-related data structure inside the kernel.

Sounds fairly likely - and not too surprising. In this scenario we've a
couple 100k registrations/unregistrations to a pretty fundamentally
shared resource (the wait queue for changes to the pipe). Not that
surprising that it becomes a problem.

There's a couple solutions I can think of to that problem:
1) Use epoll()/kqueue, or other similar interfaces that don't require
re-registering fds at every invocation. My guess is that that'd be
desirable for performance anyway.

2) Create a pair of fds between postmaster/backend for each
backend. While obviously increasing the the number of FDs noticeably,
it's interesting for other features as well: If we ever want to do FD
passing from postmaster to existing backends, we're going to need
that anyway.

3) Replace the postmaster_alive_fds socketpair by some other signalling
mechanism. E.g. sending a procsignal to each backend, which sets the
latch and a special flag in the latch structure.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Andres Freund (#12)

Re: Performance degradation in commit ac1d794

Andres Freund <andres@anarazel.de> writes:

There's a couple solutions I can think of to that problem:
1) Use epoll()/kqueue, or other similar interfaces that don't require
re-registering fds at every invocation. My guess is that that'd be
desirable for performance anyway.

Portability, on the other hand, would be problematic.

2) Create a pair of fds between postmaster/backend for each
backend. While obviously increasing the the number of FDs noticeably,
it's interesting for other features as well: If we ever want to do FD
passing from postmaster to existing backends, we're going to need
that anyway.

Maybe; it'd provide another limit on how many backends we could run.

3) Replace the postmaster_alive_fds socketpair by some other signalling
mechanism. E.g. sending a procsignal to each backend, which sets the
latch and a special flag in the latch structure.

And what would send the signal? The entire point here is to notice the
situation where the postmaster has crashed. It can *not* depend on the
postmaster taking some action.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Tom Lane (#13)

Re: Performance degradation in commit ac1d794

On 2015-12-25 16:29:53 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

There's a couple solutions I can think of to that problem:
1) Use epoll()/kqueue, or other similar interfaces that don't require
re-registering fds at every invocation. My guess is that that'd be
desirable for performance anyway.

Portability, on the other hand, would be problematic.

Indeed. But we might be able to get away with it because there's
realistically just one platform on which people run four socket
servers. Obviously we'd leave poll and select support in place. It'd be
a genuine improvement for less extreme loads on linux, too.

3) Replace the postmaster_alive_fds socketpair by some other signalling
mechanism. E.g. sending a procsignal to each backend, which sets the
latch and a special flag in the latch structure.

And what would send the signal? The entire point here is to notice the
situation where the postmaster has crashed. It can *not* depend on the
postmaster taking some action.

Ahem. Um. Look, over there --->

I blame it on all the food.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Andres Freund

andres@anarazel.de

about 10 years ago

In reply to: Andres Freund (#14)

Re: Performance degradation in commit ac1d794

On 2015-12-26 12:22:48 +0100, Andres Freund wrote:

3) Replace the postmaster_alive_fds socketpair by some other signalling
mechanism. E.g. sending a procsignal to each backend, which sets the
latch and a special flag in the latch structure.

And what would send the signal? The entire point here is to notice the
situation where the postmaster has crashed. It can *not* depend on the
postmaster taking some action.

Ahem. Um. Look, over there --->

I blame it on all the food.

A unportable and easy version of this, actually making sense this time,
would be to use prctl(PR_SET_PDEATHSIG, SIGQUIT). That'd send SIGQUIT to
backends whenever postmaster dies. Obviously that's not portable
either - doing this for linux only wouldn't be all that kludgey tho.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Andres Freund (#15)

Re: Performance degradation in commit ac1d794

Andres Freund <andres@anarazel.de> writes:

A unportable and easy version of this, actually making sense this time,
would be to use prctl(PR_SET_PDEATHSIG, SIGQUIT). That'd send SIGQUIT to
backends whenever postmaster dies. Obviously that's not portable
either - doing this for linux only wouldn't be all that kludgey tho.

Hmm. That would have semantics rather substantially different from
the way that the WL_POSTMASTER_DEATH code behaves. But I don't know
how much we care about that, since the whole scenario is something
that should not happen under normal circumstances. Maybe cross-platform
variation is OK as long as it doesn't make the code too hairy.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Amit Kapila

amit.kapila16@gmail.com

about 10 years ago

In reply to: Andres Freund (#15)

Re: Performance degradation in commit ac1d794

On Sat, Dec 26, 2015 at 5:41 PM, Andres Freund <andres@anarazel.de> wrote:

On 2015-12-26 12:22:48 +0100, Andres Freund wrote:

3) Replace the postmaster_alive_fds socketpair by some other

signalling

mechanism. E.g. sending a procsignal to each backend, which sets

the

latch and a special flag in the latch structure.

And what would send the signal? The entire point here is to notice the
situation where the postmaster has crashed. It can *not* depend on the
postmaster taking some action.

Ahem. Um. Look, over there --->

I blame it on all the food.

A unportable and easy version of this, actually making sense this time,
would be to use prctl(PR_SET_PDEATHSIG, SIGQUIT). That'd send SIGQUIT to
backends whenever postmaster dies. Obviously that's not portable
either - doing this for linux only wouldn't be all that kludgey tho.

There is a way to make backends exit in Windows as well by using
JobObjects and use limitFlags as JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE
for JOBOBJECT_BASIC_LIMIT_INFORMATION.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#18

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Andres Freund (#14)

4 attachment(s)

Re: Performance degradation in commit ac1d794

On 2015-12-26 12:22:48 +0100, Andres Freund wrote:

On 2015-12-25 16:29:53 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

There's a couple solutions I can think of to that problem:
1) Use epoll()/kqueue, or other similar interfaces that don't require
re-registering fds at every invocation. My guess is that that'd be
desirable for performance anyway.

Portability, on the other hand, would be problematic.

Indeed. But we might be able to get away with it because there's
realistically just one platform on which people run four socket
servers. Obviously we'd leave poll and select support in place. It'd be
a genuine improvement for less extreme loads on linux, too.

I finally got back to working on this. Attached is a WIP patch series
implementing:
0001: Allow to easily choose between the readiness primitives in unix_latch.c
Pretty helpful for testing, not useful for anything else.
0002: Error out if waiting on socket readiness without a specified socket.
0003: Only clear unix_latch.c's self-pipe if it actually contains data.
~2% on high qps workloads
0004: Support using epoll as the polling primitive in unix_latch.c.
~3% on high qps workloads, massive scalability improvements (x3)
on very large machines.

With 0004 obviously being the relevant bit for this thread. I verified
that using epoll addresses the performance problem, using the hardware
the OP noticed the performance problem on.

The reason I went with using epoll over the PR_SET_PDEATHSIG approach is
that it provides semantics that are more similar to the other platforms,
while being just as platform dependant as PR_SET_PDEATHSIG. It also is
actually measurably faster, at least here.

0004 currently contains one debatable optimization, which I'd like to
discuss: Currently the 'sock' passed to WaitLatchOrSocket is not
removed/added to the epoll fd, if it's the numerically same as in the
last call. That's good for performance, but would be wrong if the socket
were close and a new one with the same value would be waited on. I
think a big warning sign somewhere is sufficient to deal with that
problem - it's not something we're likely to start doing. And even if
it's done at some point, we can just offer an API to reset the last used
socket fd.

Unless somebody comes up with a platform independent way of addressing
this, I'm inclined to press forward using epoll(). Opinions?

Andres

Attachments:

0001-Make-it-easier-to-choose-the-used-waiting-primitive-.patchtext/x-patch; charset=us-asciiDownload

>From fb67ecf2f6f65525af1ed7c5d5e5dd46e8fa6fc4 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 14 Jan 2016 14:17:43 +0100
Subject: [PATCH 1/4] Make it easier to choose the used waiting primitive in
 unix_latch.c.

---
 src/backend/port/unix_latch.c | 50 +++++++++++++++++++++++++++++--------------
 1 file changed, 34 insertions(+), 16 deletions(-)

diff --git a/src/backend/port/unix_latch.c b/src/backend/port/unix_latch.c
index 2ad609c..f52704b 100644
--- a/src/backend/port/unix_latch.c
+++ b/src/backend/port/unix_latch.c
@@ -56,6 +56,22 @@
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
 
+/*
+ * Select the fd readiness primitive to use. Normally the "most modern"
+ * primitive supported by the OS will be used, but for testing it can be
+ * useful to manually specify the used primitive.  If desired, just add a
+ * define somewhere before this block.
+ */
+#if defined(LATCH_USE_POLL) || defined(LATCH_USE_SELECT)
+/* don't overwrite manual choice */
+#elif defined(HAVE_POLL)
+#define LATCH_USE_POLL
+#elif HAVE_SYS_SELECT_H
+#define LATCH_USE_SELECT
+#else
+#error "no latch implementation available"
+#endif
+
 /* Are we currently in WaitLatch? The signal handler would like to know. */
 static volatile sig_atomic_t waiting = false;
 
@@ -215,10 +231,10 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 				cur_time;
 	long		cur_timeout;
 
-#ifdef HAVE_POLL
+#if defined(LATCH_USE_POLL)
 	struct pollfd pfds[3];
 	int			nfds;
-#else
+#elif defined(LATCH_USE_SELECT)
 	struct timeval tv,
 			   *tvp;
 	fd_set		input_mask;
@@ -247,7 +263,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 		Assert(timeout >= 0 && timeout <= INT_MAX);
 		cur_timeout = timeout;
 
-#ifndef HAVE_POLL
+#ifdef LATCH_USE_SELECT
 		tv.tv_sec = cur_timeout / 1000L;
 		tv.tv_usec = (cur_timeout % 1000L) * 1000L;
 		tvp = &tv;
@@ -257,7 +273,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	{
 		cur_timeout = -1;
 
-#ifndef HAVE_POLL
+#ifdef LATCH_USE_SELECT
 		tvp = NULL;
 #endif
 	}
@@ -291,16 +307,10 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 		}
 
 		/*
-		 * Must wait ... we use poll(2) if available, otherwise select(2).
-		 *
-		 * On at least older linux kernels select(), in violation of POSIX,
-		 * doesn't reliably return a socket as writable if closed - but we
-		 * rely on that. So far all the known cases of this problem are on
-		 * platforms that also provide a poll() implementation without that
-		 * bug.  If we find one where that's not the case, we'll need to add a
-		 * workaround.
+		 * Must wait ... we use the polling interface determined at the top of
+		 * this file to do so.
 		 */
-#ifdef HAVE_POLL
+#if defined(LATCH_USE_POLL)
 		nfds = 0;
 		if (wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))
 		{
@@ -396,8 +406,16 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 					result |= WL_POSTMASTER_DEATH;
 			}
 		}
-#else							/* !HAVE_POLL */
+#elif defined(LATCH_USE_SELECT)
 
+		/*
+		 * On at least older linux kernels select(), in violation of POSIX,
+		 * doesn't reliably return a socket as writable if closed - but we
+		 * rely on that. So far all the known cases of this problem are on
+		 * platforms that also provide a poll() implementation without that
+		 * bug.  If we find one where that's not the case, we'll need to add a
+		 * workaround.
+		 */
 		FD_ZERO(&input_mask);
 		FD_ZERO(&output_mask);
 
@@ -477,7 +495,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 					result |= WL_POSTMASTER_DEATH;
 			}
 		}
-#endif   /* HAVE_POLL */
+#endif   /* LATCH_USE_SELECT */
 
 		/* If we're not done, update cur_timeout for next iteration */
 		if (result == 0 && (wakeEvents & WL_TIMEOUT))
@@ -490,7 +508,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 				/* Timeout has expired, no need to continue looping */
 				result |= WL_TIMEOUT;
 			}
-#ifndef HAVE_POLL
+#ifdef LATCH_USE_SELECT
 			else
 			{
 				tv.tv_sec = cur_timeout / 1000L;
-- 
2.5.0.400.gff86faf.dirty

0002-Error-out-if-waiting-on-socket-readiness-without-a-s.patchtext/x-patch; charset=us-asciiDownload

>From cd5a66b55a00ba70613cfbe45be758a64d2112f8 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 14 Jan 2016 14:24:09 +0100
Subject: [PATCH 2/4] Error out if waiting on socket readiness without a
 specified socket.

---
 src/backend/port/unix_latch.c  | 7 ++++---
 src/backend/port/win32_latch.c | 6 ++++--
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/src/backend/port/unix_latch.c b/src/backend/port/unix_latch.c
index f52704b..ad621ea 100644
--- a/src/backend/port/unix_latch.c
+++ b/src/backend/port/unix_latch.c
@@ -242,9 +242,10 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	int			hifd;
 #endif
 
-	/* Ignore WL_SOCKET_* events if no valid socket is given */
-	if (sock == PGINVALID_SOCKET)
-		wakeEvents &= ~(WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE);
+	/* waiting for socket readiness without a socket indicates a bug */
+	if (sock == PGINVALID_SOCKET &&
+		(wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
+		elog(ERROR, "cannot wait on socket events without a socket");
 
 	Assert(wakeEvents != 0);	/* must have at least one wake event */
 
diff --git a/src/backend/port/win32_latch.c b/src/backend/port/win32_latch.c
index 80adc13..e101acf 100644
--- a/src/backend/port/win32_latch.c
+++ b/src/backend/port/win32_latch.c
@@ -119,8 +119,10 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 
 	Assert(wakeEvents != 0);	/* must have at least one wake event */
 
-	if ((wakeEvents & WL_LATCH_SET) && latch->owner_pid != MyProcPid)
-		elog(ERROR, "cannot wait on a latch owned by another process");
+	/* waiting for socket readiness without a socket indicates a bug */
+	if (sock == PGINVALID_SOCKET &&
+		(wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
+		elog(ERROR, "cannot wait on socket events without a socket");
 
 	/*
 	 * Initialize timeout if requested.  We must record the current time so
-- 
2.5.0.400.gff86faf.dirty

0003-Only-clear-unix_latch.c-s-self-pipe-if-it-actually-c.patchtext/x-patch; charset=us-asciiDownload

>From 162f66f7fccc335d8caad7bb15be1c2030ec838e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 14 Jan 2016 15:15:17 +0100
Subject: [PATCH 3/4] Only clear unix_latch.c's self-pipe if it actually
 contains data.

This avoids a good number of, individually quite fast, system calls in
scenarios with many quick queries. Besides the aesthetic benefit of
seing fewer superflous system calls with strace, it also improves
performance by ~2% measured by pgbench -M prepared -c 96 -j 8 -S (scale
100).
---
 src/backend/port/unix_latch.c | 77 ++++++++++++++++++++++++++++---------------
 1 file changed, 51 insertions(+), 26 deletions(-)

diff --git a/src/backend/port/unix_latch.c b/src/backend/port/unix_latch.c
index ad621ea..03bca68 100644
--- a/src/backend/port/unix_latch.c
+++ b/src/backend/port/unix_latch.c
@@ -283,27 +283,27 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	do
 	{
 		/*
-		 * Clear the pipe, then check if the latch is set already. If someone
-		 * sets the latch between this and the poll()/select() below, the
-		 * setter will write a byte to the pipe (or signal us and the signal
-		 * handler will do that), and the poll()/select() will return
-		 * immediately.
+		 * Check if the latch is set already. If so, leave loop immediately,
+		 * avoid blocking again. We don't attempt to report any other events
+		 * that might also be satisfied.
+		 *
+		 * If someone sets the latch between this and the poll()/select()
+		 * below, the setter will write a byte to the pipe (or signal us and
+		 * the signal handler will do that), and the poll()/select() will
+		 * return immediately.
+		 *
+		 * If there's a pending byte in the self pipe, we'll notice whenever
+		 * blocking. Only clearing the pipe in that case avoids having to
+		 * drain it everytime WaitLatchOrSocket() is used.
 		 *
 		 * Note: we assume that the kernel calls involved in drainSelfPipe()
 		 * and SetLatch() will provide adequate synchronization on machines
 		 * with weak memory ordering, so that we cannot miss seeing is_set if
 		 * the signal byte is already in the pipe when we drain it.
 		 */
-		drainSelfPipe();
-
 		if ((wakeEvents & WL_LATCH_SET) && latch->is_set)
 		{
 			result |= WL_LATCH_SET;
-
-			/*
-			 * Leave loop immediately, avoid blocking again. We don't attempt
-			 * to report any other events that might also be satisfied.
-			 */
 			break;
 		}
 
@@ -313,24 +313,26 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 		 */
 #if defined(LATCH_USE_POLL)
 		nfds = 0;
+
+		/* selfpipe is always in pfds[0] */
+		pfds[0].fd = selfpipe_readfd;
+		pfds[0].events = POLLIN;
+		pfds[0].revents = 0;
+		nfds++;
+
 		if (wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))
 		{
-			/* socket, if used, is always in pfds[0] */
-			pfds[0].fd = sock;
-			pfds[0].events = 0;
+			/* socket, if used, is always in pfds[1] */
+			pfds[1].fd = sock;
+			pfds[1].events = 0;
 			if (wakeEvents & WL_SOCKET_READABLE)
-				pfds[0].events |= POLLIN;
+				pfds[1].events |= POLLIN;
 			if (wakeEvents & WL_SOCKET_WRITEABLE)
-				pfds[0].events |= POLLOUT;
-			pfds[0].revents = 0;
+				pfds[1].events |= POLLOUT;
+			pfds[1].revents = 0;
 			nfds++;
 		}
 
-		pfds[nfds].fd = selfpipe_readfd;
-		pfds[nfds].events = POLLIN;
-		pfds[nfds].revents = 0;
-		nfds++;
-
 		if (wakeEvents & WL_POSTMASTER_DEATH)
 		{
 			/* postmaster fd, if used, is always in pfds[nfds - 1] */
@@ -364,19 +366,26 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 		else
 		{
 			/* at least one event occurred, so check revents values */
+
+			if (pfds[0].revents & POLLIN)
+			{
+				/* There's data in the self-pipe, clear it. */
+				drainSelfPipe();
+			}
+
 			if ((wakeEvents & WL_SOCKET_READABLE) &&
-				(pfds[0].revents & POLLIN))
+				(pfds[1].revents & POLLIN))
 			{
 				/* data available in socket, or EOF/error condition */
 				result |= WL_SOCKET_READABLE;
 			}
 			if ((wakeEvents & WL_SOCKET_WRITEABLE) &&
-				(pfds[0].revents & POLLOUT))
+				(pfds[1].revents & POLLOUT))
 			{
 				/* socket is writable */
 				result |= WL_SOCKET_WRITEABLE;
 			}
-			if (pfds[0].revents & (POLLHUP | POLLERR | POLLNVAL))
+			if (pfds[1].revents & (POLLHUP | POLLERR | POLLNVAL))
 			{
 				/* EOF/error condition */
 				if (wakeEvents & WL_SOCKET_READABLE)
@@ -468,6 +477,11 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 		else
 		{
 			/* at least one event occurred, so check masks */
+			if (FD_ISSET(selfpipe_readfd, &input_mask))
+			{
+				/* There's data in the self-pipe, clear it. */
+				drainSelfPipe();
+			}
 			if ((wakeEvents & WL_SOCKET_READABLE) && FD_ISSET(sock, &input_mask))
 			{
 				/* data available in socket, or EOF */
@@ -498,6 +512,17 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 		}
 #endif   /* LATCH_USE_SELECT */
 
+		/*
+		 * Check again wether latch is set, the arrival of a signal/self-byte
+		 * might be what stopped our sleep. It's not required for correctness
+		 * to signal the latch as being set (we'd just loop if there's no
+		 * other event), but it seems good to report an arrived latch asap.
+		 */
+		if ((wakeEvents & WL_LATCH_SET) && latch->is_set)
+		{
+			result |= WL_LATCH_SET;
+		}
+
 		/* If we're not done, update cur_timeout for next iteration */
 		if (result == 0 && (wakeEvents & WL_TIMEOUT))
 		{
-- 
2.5.0.400.gff86faf.dirty

0004-Support-using-epoll-as-the-polling-primitive-in-unix.patchtext/x-patch; charset=us-asciiDownload

>From fe417866a7132b1ee65e2ed96f79fbaad7922435 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 14 Jan 2016 15:24:15 +0100
Subject: [PATCH 4/4] Support using epoll as the polling primitive in
 unix_latch.c.

epoll(2) has the advantage of being able to reuse the wait datastructure
from previous calls when waiting the next time, on the same
events. Especially when waiting on a socket used by many processes like
the postmaster_alive_fd, that's good for scalability.
---
 configure                     |   2 +-
 configure.in                  |   2 +-
 src/backend/port/unix_latch.c | 228 +++++++++++++++++++++++++++++++++++++++++-
 src/include/pg_config.h.in    |   3 +
 src/include/storage/latch.h   |   4 +
 5 files changed, 234 insertions(+), 5 deletions(-)

diff --git a/configure b/configure
index 3dd1b15..d65e0b4 100755
--- a/configure
+++ b/configure
@@ -10144,7 +10144,7 @@ fi
 ## Header files
 ##
 
-for ac_header in atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h pwd.h sys/ioctl.h sys/ipc.h sys/poll.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/socket.h sys/sockio.h sys/tas.h sys/time.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h
+for ac_header in atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h pwd.h sys/epoll.h sys/ioctl.h sys/ipc.h sys/poll.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/socket.h sys/sockio.h sys/tas.h sys/time.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h
 do :
   as_ac_Header=`$as_echo "ac_cv_header_$ac_header" | $as_tr_sh`
 ac_fn_c_check_header_mongrel "$LINENO" "$ac_header" "$as_ac_Header" "$ac_includes_default"
diff --git a/configure.in b/configure.in
index 9398482..d24b7e8 100644
--- a/configure.in
+++ b/configure.in
@@ -1163,7 +1163,7 @@ AC_SUBST(UUID_LIBS)
 ##
 
 dnl sys/socket.h is required by AC_FUNC_ACCEPT_ARGTYPES
-AC_CHECK_HEADERS([atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h pwd.h sys/ioctl.h sys/ipc.h sys/poll.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/socket.h sys/sockio.h sys/tas.h sys/time.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h])
+AC_CHECK_HEADERS([atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h pwd.h sys/epoll.h sys/ioctl.h sys/ipc.h sys/poll.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/socket.h sys/sockio.h sys/tas.h sys/time.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h])
 
 # On BSD, test for net/if.h will fail unless sys/socket.h
 # is included first.
diff --git a/src/backend/port/unix_latch.c b/src/backend/port/unix_latch.c
index 03bca68..5e0edf6 100644
--- a/src/backend/port/unix_latch.c
+++ b/src/backend/port/unix_latch.c
@@ -38,6 +38,9 @@
 #include <unistd.h>
 #include <sys/time.h>
 #include <sys/types.h>
+#ifdef HAVE_SYS_EPOLL_H
+#include <sys/epoll.h>
+#endif
 #ifdef HAVE_POLL_H
 #include <poll.h>
 #endif
@@ -62,8 +65,10 @@
  * useful to manually specify the used primitive.  If desired, just add a
  * define somewhere before this block.
  */
-#if defined(LATCH_USE_POLL) || defined(LATCH_USE_SELECT)
+#if defined(LATCH_USE_EPOLL) || defined(LATCH_USE_POLL) || defined(LATCH_USE_SELECT)
 /* don't overwrite manual choice */
+#elif defined(HAVE_SYS_EPOLL_H)
+#define LATCH_USE_EPOLL
 #elif defined(HAVE_POLL)
 #define LATCH_USE_POLL
 #elif HAVE_SYS_SELECT_H
@@ -82,6 +87,9 @@ static int	selfpipe_writefd = -1;
 /* Private function prototypes */
 static void sendSelfPipeByte(void);
 static void drainSelfPipe(void);
+#ifdef LATCH_USE_EPOLL
+static void initEpoll(volatile Latch *latch);
+#endif
 
 
 /*
@@ -127,6 +135,10 @@ InitLatch(volatile Latch *latch)
 	latch->is_set = false;
 	latch->owner_pid = MyProcPid;
 	latch->is_shared = false;
+
+#ifdef LATCH_USE_EPOLL
+	initEpoll(latch);
+#endif
 }
 
 /*
@@ -174,6 +186,10 @@ OwnLatch(volatile Latch *latch)
 		elog(ERROR, "latch already owned");
 
 	latch->owner_pid = MyProcPid;
+
+#ifdef LATCH_USE_EPOLL
+	initEpoll(latch);
+#endif
 }
 
 /*
@@ -186,6 +202,14 @@ DisownLatch(volatile Latch *latch)
 	Assert(latch->owner_pid == MyProcPid);
 
 	latch->owner_pid = 0;
+
+#ifdef LATCH_USE_EPOLL
+	if (latch->epollfd >= 0)
+	{
+		close(latch->epollfd);
+		latch->epollfd = -1;
+	}
+#endif
 }
 
 /*
@@ -231,7 +255,9 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 				cur_time;
 	long		cur_timeout;
 
-#if defined(LATCH_USE_POLL)
+#if defined(LATCH_USE_EPOLL)
+	struct epoll_event events[1];
+#elif defined(LATCH_USE_POLL)
 	struct pollfd pfds[3];
 	int			nfds;
 #elif defined(LATCH_USE_SELECT)
@@ -311,7 +337,175 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 		 * Must wait ... we use the polling interface determined at the top of
 		 * this file to do so.
 		 */
-#if defined(LATCH_USE_POLL)
+#if defined(LATCH_USE_EPOLL)
+		if (wakeEvents != latch->lastmask || latch->lastwatchfd != sock)
+		{
+			bool sockfd_changed = latch->lastwatchfd != sock;
+
+			if (latch->lastwatchfd != -1 && sockfd_changed)
+			{
+				struct epoll_event data;
+
+				/*
+				 * Unnecessarily pass data for delete due to bug errorneously
+				 * requiring it in the past.
+				 */
+				rc = epoll_ctl(latch->epollfd, EPOLL_CTL_DEL,
+							   latch->lastwatchfd, &data);
+				if (rc < 0)
+				{
+					waiting = false;
+					ereport(ERROR,
+							(errcode_for_socket_access(),
+							 errmsg("epoll_ctl() failed: %m")));
+				}
+
+				latch->lastwatchfd = -1;
+			}
+
+			if (sock != -1 && sockfd_changed)
+			{
+				struct epoll_event data;
+				data.events = 0;
+				data.data.fd = sock;
+				rc = epoll_ctl(latch->epollfd, EPOLL_CTL_ADD, sock, &data);
+				if (rc < 0)
+				{
+					waiting = false;
+					ereport(ERROR,
+							(errcode_for_socket_access(),
+							 errmsg("epoll_ctl() failed: %m")));
+				}
+
+				latch->lastwatchfd = sock;
+			}
+
+			if (sock != -1 && (
+					sockfd_changed ||
+					(wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) !=
+					(latch->lastmask & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))))
+			{
+				struct epoll_event data;
+
+				data.events = EPOLLRDHUP | EPOLLERR | EPOLLHUP;
+				data.data.fd = sock;
+
+				if (wakeEvents & WL_SOCKET_READABLE)
+					data.events |= EPOLLIN;
+				if (wakeEvents & WL_SOCKET_WRITEABLE)
+					data.events |= EPOLLOUT;
+
+				rc = epoll_ctl(latch->epollfd, EPOLL_CTL_MOD, sock, &data);
+				if (rc < 0)
+				{
+					waiting = false;
+					ereport(ERROR,
+							(errcode_for_socket_access(),
+							 errmsg("epoll_ctl() failed: %m")));
+				}
+			}
+
+			if ((latch->lastmask & WL_POSTMASTER_DEATH) &&
+				!(wakeEvents & WL_POSTMASTER_DEATH))
+			{
+				struct epoll_event data;
+
+				/*
+				 * Unnecessarily pass data for delete due to bug errorneously
+				 * requiring it in the past.
+				 */
+				rc = epoll_ctl(latch->epollfd, EPOLL_CTL_DEL,
+							   postmaster_alive_fds[POSTMASTER_FD_WATCH],
+							   &data);
+				if (rc < 0)
+				{
+					waiting = false;
+					ereport(ERROR,
+							(errcode_for_socket_access(),
+							 errmsg("epoll_ctl() failed: %m")));
+				}
+			}
+
+
+			if (!(latch->lastmask & WL_POSTMASTER_DEATH) &&
+				(wakeEvents & WL_POSTMASTER_DEATH))
+			{
+				struct epoll_event data;
+
+				data.events = EPOLLIN | EPOLLHUP | EPOLLRDHUP | EPOLLERR;
+				data.data.fd = postmaster_alive_fds[POSTMASTER_FD_WATCH];
+
+				rc = epoll_ctl(latch->epollfd, EPOLL_CTL_ADD,
+							   postmaster_alive_fds[POSTMASTER_FD_WATCH],
+							   &data);
+				if (rc < 0)
+				{
+					waiting = false;
+					ereport(ERROR,
+							(errcode_for_socket_access(),
+							 errmsg("epoll_ctl() failed: %m")));
+				}
+			}
+
+			latch->lastmask = wakeEvents;
+		}
+
+		rc = epoll_wait(latch->epollfd, events, 1, cur_timeout);
+		if (rc < 0)
+		{
+			/* EINTR is okay, otherwise complain */
+			if (errno != EINTR)
+			{
+				waiting = false;
+				ereport(ERROR,
+						(errcode_for_socket_access(),
+						 errmsg("epoll_wait() failed: %m")));
+			}
+		}
+		else if (rc == 0)
+		{
+			/* timeout exceeded */
+			if (wakeEvents & WL_TIMEOUT)
+				result |= WL_TIMEOUT;
+		}
+		else
+		{
+			if (events[0].data.fd == sock)
+			{
+				/* data available in socket */
+				if (events[0].events & EPOLLIN)
+					result |= WL_SOCKET_READABLE;
+
+				/* socket is writable */
+				if (events[0].events & EPOLLOUT)
+					result |= WL_SOCKET_WRITEABLE;
+
+				/* EOF/error condition */
+				if (events[0].events & (EPOLLERR | EPOLLHUP | EPOLLRDHUP))
+				{
+					if (wakeEvents & WL_SOCKET_READABLE)
+						result |= WL_SOCKET_READABLE;
+					if (wakeEvents & WL_SOCKET_WRITEABLE)
+						result |= WL_SOCKET_WRITEABLE;
+				}
+			}
+
+			if (events[0].data.fd == postmaster_alive_fds[POSTMASTER_FD_WATCH] &&
+				events[0].events & (EPOLLIN | EPOLLHUP | EPOLLERR | EPOLLRDHUP))
+			{
+				/* check comment for the corresponding LATCH_USE_POLL case */
+				Assert(!PostmasterIsAlive());
+				result |= WL_POSTMASTER_DEATH;
+			}
+
+			if (events[0].data.fd == selfpipe_readfd &&
+				events[0].events & EPOLLIN)
+			{
+				/* There's data in the self-pipe, clear it. */
+				drainSelfPipe();
+			}
+		}
+#elif defined(LATCH_USE_POLL)
 		nfds = 0;
 
 		/* selfpipe is always in pfds[0] */
@@ -725,3 +919,31 @@ drainSelfPipe(void)
 		/* else buffer wasn't big enough, so read again */
 	}
 }
+
+#ifdef LATCH_USE_EPOLL
+/*
+ * Create the epoll fd used to wait for readiness. Needs to be called whenever
+ * owning a latch, be it a shared or a backend-local one.
+ */
+static void
+initEpoll(volatile Latch *latch)
+{
+	struct epoll_event data;
+	int rc;
+
+	/* one each for selfpipe, socket, postmaster alive fd */
+	latch->epollfd = epoll_create(3);
+	if (latch->epollfd < 0)
+		elog(FATAL, "epoll_create failed: %m");
+
+	/* always want to be nodified of writes into thee self-pipe */
+	data.events = EPOLLIN;
+	data.data.fd = selfpipe_readfd;
+	rc = epoll_ctl(latch->epollfd, EPOLL_CTL_ADD, selfpipe_readfd, &data);
+	if (rc < 0)
+		elog(FATAL, "epoll_ctl failed: %m");
+
+	latch->lastwatchfd = -1;
+	latch->lastmask = 0;
+}
+#endif
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 16a272e..0fc4ce2 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -530,6 +530,9 @@
 /* Define to 1 if you have the syslog interface. */
 #undef HAVE_SYSLOG
 
+/* Define to 1 if you have the <sys/epoll.h> header file. */
+#undef HAVE_SYS_EPOLL_H
+
 /* Define to 1 if you have the <sys/ioctl.h> header file. */
 #undef HAVE_SYS_IOCTL_H
 
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index e77491e..3666352 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -92,6 +92,10 @@ typedef struct Latch
 	int			owner_pid;
 #ifdef WIN32
 	HANDLE		event;
+#elif defined(HAVE_SYS_EPOLL_H)
+	int			epollfd;
+	int			lastwatchfd;
+	int			lastmask;
 #endif
 } Latch;
 
-- 
2.5.0.400.gff86faf.dirty

#19

Tom Lane

tgl@sss.pgh.pa.us

almost 10 years ago

In reply to: Andres Freund (#18)

Re: Performance degradation in commit ac1d794

Andres Freund <andres@anarazel.de> writes:

0004 currently contains one debatable optimization, which I'd like to
discuss: Currently the 'sock' passed to WaitLatchOrSocket is not
removed/added to the epoll fd, if it's the numerically same as in the
last call. That's good for performance, but would be wrong if the socket
were close and a new one with the same value would be waited on. I
think a big warning sign somewhere is sufficient to deal with that
problem - it's not something we're likely to start doing. And even if
it's done at some point, we can just offer an API to reset the last used
socket fd.

Perhaps a cleaner API solution would be to remove the socket argument per
se from the function altogether, instead providing a separate
SetSocketToWaitOn() call.

(Also, if there is a need for it, we could provide a function that still
takes a socket argument, with the understanding that it's to be used for
short-lived sockets where you don't want to change the process's main
epoll state.)

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Robert Haas

robertmhaas@gmail.com

almost 10 years ago

In reply to: Andres Freund (#18)

Re: Performance degradation in commit ac1d794

On Thu, Jan 14, 2016 at 9:39 AM, Andres Freund <andres@anarazel.de> wrote:

I finally got back to working on this. Attached is a WIP patch series
implementing:
0001: Allow to easily choose between the readiness primitives in unix_latch.c
Pretty helpful for testing, not useful for anything else.

Looks good.

0002: Error out if waiting on socket readiness without a specified socket.

Looks good.

0003: Only clear unix_latch.c's self-pipe if it actually contains data.
~2% on high qps workloads

everytime -> every time

+        if ((wakeEvents & WL_LATCH_SET) && latch->is_set)
+        {
+            result |= WL_LATCH_SET;
+        }

Excess braces.

Doesn't this code make it possible for the self-pipe to fill up,
self-deadlocking the process? Suppose we repeatedly enter
WaitLatchOrSocket(). Each time we do, just after waiting = true is
set, somebody sets the latch. We handle the signal and put a byte
into the pipe. Returning from the signal handler, we then notice that
is_set is true and return at once, without draining the pipe. Repeat
until something bad happens.

0004: Support using epoll as the polling primitive in unix_latch.c.
~3% on high qps workloads, massive scalability improvements (x3)
on very large machines.

With 0004 obviously being the relevant bit for this thread. I verified
that using epoll addresses the performance problem, using the hardware
the OP noticed the performance problem on.

+                /*
+                 * Unnecessarily pass data for delete due to bug errorneously
+                 * requiring it in the past.
+                 */

This is pretty vague. And it has a spelling mistake.

Further down, nodified -> notified.

+ if (wakeEvents != latch->lastmask || latch->lastwatchfd != sock)

I don't like this very much. I think it's a bad idea to test
latch->lastwatchfd != sock. That has an excellent change of letting
people write code that appears to work but then doesn't. I think it
would be better, if we're going to change the API contract, to make it
a hard break, as I see Tom has also suggested while I've been writing
this.

Incidentally, if we're going to whack around the latch API, it would
be nice to pick a design which wouldn't be too hard to extend to
waiting on multiple sockets. The application I have in mind is to
send of queries to several foreign servers at once and then wait until
bytes come back from any of them. It's mostly pie in the sky at this
point, but it seems highly likely to me that we'd want to do such a
thing by waiting for bytes from any of the sockets involved OR a latch
event.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Tom Lane

tgl@sss.pgh.pa.us

almost 10 years ago

In reply to: Robert Haas (#20)

Re: Performance degradation in commit ac1d794

Robert Haas <robertmhaas@gmail.com> writes:

Incidentally, if we're going to whack around the latch API, it would
be nice to pick a design which wouldn't be too hard to extend to
waiting on multiple sockets. The application I have in mind is to
send of queries to several foreign servers at once and then wait until
bytes come back from any of them. It's mostly pie in the sky at this
point, but it seems highly likely to me that we'd want to do such a
thing by waiting for bytes from any of the sockets involved OR a latch
event.

Instead of SetSocketToWaitOn, maybe AddSocketToWaitSet and
RemoveSocketFromWaitSet? And you'd need some way of identifying
which socket came ready after a wait call...

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Robert Haas (#20)

Re: Performance degradation in commit ac1d794

On 2016-01-14 10:39:55 -0500, Robert Haas wrote:

Doesn't this code make it possible for the self-pipe to fill up,
self-deadlocking the process? Suppose we repeatedly enter
WaitLatchOrSocket(). Each time we do, just after waiting = true is
set, somebody sets the latch. We handle the signal and put a byte
into the pipe. Returning from the signal handler, we then notice that
is_set is true and return at once, without draining the pipe. Repeat
until something bad happens.

Should be fine because the self-pipe is marked as non-blocking
if (fcntl(pipefd[1], F_SETFL, O_NONBLOCK) < 0)
elog(FATAL, "fcntl() failed on write-end of self-pipe: %m");
and sendSelfPipeByte accepts the blocking case as success

/*
* If the pipe is full, we don't need to retry, the data that's there
* already is enough to wake up WaitLatch.
*/
if (errno == EAGAIN || errno == EWOULDBLOCK)
return;

0004: Support using epoll as the polling primitive in unix_latch.c.
~3% on high qps workloads, massive scalability improvements (x3)
on very large machines.

With 0004 obviously being the relevant bit for this thread. I verified
that using epoll addresses the performance problem, using the hardware
the OP noticed the performance problem on.
+                /*
+                 * Unnecessarily pass data for delete due to bug errorneously
+                 * requiring it in the past.
+                 */
This is pretty vague. And it has a spelling mistake.

Will add a reference to the manpage (where that requirement is coming
from).

Further down, nodified -> notified.

+ if (wakeEvents != latch->lastmask || latch->lastwatchfd != sock)

I don't like this very much.

Yea, me neither, which is why I called it out... I think it's not too
likely to cause problems in practice though. But I think changing the
API makes sense, so the likelihood shouldn't be a relevant issue.

Incidentally, if we're going to whack around the latch API, it would
be nice to pick a design which wouldn't be too hard to extend to
waiting on multiple sockets.

Hm. That seems likely to make usage harder for users of the API. So it
seems like it'd make sense to provide a simpler version anyway, for the
majority of users.

So, I'm wondering how we'd exactly use a hyptothetical
SetSocketToWaitOn, or SetSocketsToWaitOn (or whatever). I mean it can
make a fair bit of sense to sometimes wait on MyLatch/port->sock and
sometimes on MyLatch/fdw connections. The simple proposed code would
change the epoll set whenever switching between both, but with
SetSocketsToWaitOn you'd probably end up switching this much more often?

One way to address that would be to create a 'latch wait' datastructure,
that'd then contain the epoll fd/win32 wait events/... That way you
could have one 'LatchWait' for latch + client socket and one for latch +
fdw sockets.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23

Robert Haas

robertmhaas@gmail.com

almost 10 years ago

In reply to: Tom Lane (#21)

Re: Performance degradation in commit ac1d794

On Thu, Jan 14, 2016 at 10:46 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Incidentally, if we're going to whack around the latch API, it would
be nice to pick a design which wouldn't be too hard to extend to
waiting on multiple sockets. The application I have in mind is to
send of queries to several foreign servers at once and then wait until
bytes come back from any of them. It's mostly pie in the sky at this
point, but it seems highly likely to me that we'd want to do such a
thing by waiting for bytes from any of the sockets involved OR a latch
event.

Instead of SetSocketToWaitOn, maybe AddSocketToWaitSet and
RemoveSocketFromWaitSet? And you'd need some way of identifying
which socket came ready after a wait call...

Yeah. Although I think for now it would be fine to just error out if
somebody tries to add a socket and there already is one. Then we
could lift that limitation in a later commit. Of course if Andres
wants to do the whole thing now I'm not going to get in the way, but
since that will require Windows tinkering and so on it may be more
than he wants to dive into.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Robert Haas (#23)

Re: Performance degradation in commit ac1d794

On January 14, 2016 5:16:59 PM GMT+01:00, Robert Haas <robertmhaas@gmail.com> wrote:

Yeah. Although I think for now it would be fine to just error out if
somebody tries to add a socket and there already is one. Then we
could lift that limitation in a later commit. Of course if Andres
wants to do the whole thing now I'm not going to get in the way, but
since that will require Windows tinkering and so on it may be more
than he wants to dive into.

Yea, I don't want to do anything really large at the moment. My primary interest is fixing the major performance regression.

Andres

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25

Robert Haas

robertmhaas@gmail.com

almost 10 years ago

In reply to: Andres Freund (#22)

Re: Performance degradation in commit ac1d794

On Thu, Jan 14, 2016 at 10:56 AM, Andres Freund <andres@anarazel.de> wrote:

So, I'm wondering how we'd exactly use a hyptothetical
SetSocketToWaitOn, or SetSocketsToWaitOn (or whatever). I mean it can
make a fair bit of sense to sometimes wait on MyLatch/port->sock and
sometimes on MyLatch/fdw connections. The simple proposed code would
change the epoll set whenever switching between both, but with
SetSocketsToWaitOn you'd probably end up switching this much more often?

One way to address that would be to create a 'latch wait' datastructure,
that'd then contain the epoll fd/win32 wait events/... That way you
could have one 'LatchWait' for latch + client socket and one for latch +
fdw sockets.

I see your point. As far as I can see, it's currently true that,
right now, the only places where we wait for a socket are places where
the socket will live for the lifetime of the backend, but I think we
should regard it as likely that, in the future, we'll want to use it
anywhere we want to wait for a socket to become ready. There are
getting to be a lot of places where we need to unstick some loop
whenever the process latch gets set, and it seems likely to me that
needs will only continue to grow. So the API should probably
contemplate that sort of need.

I think your idea of a data structure the encapsulates a set of events
for which to wait is probably a good one. WaitLatch doesn't seem like
a great name. Maybe WaitEventSet, and then we can have
WaitLatch(&latch) and WaitEvents(&eventset).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Robert Haas (#25)

Re: Performance degradation in commit ac1d794

On 2016-01-14 11:31:03 -0500, Robert Haas wrote:

On Thu, Jan 14, 2016 at 10:56 AM, Andres Freund <andres@anarazel.de> wrote:
I think your idea of a data structure the encapsulates a set of events
for which to wait is probably a good one. WaitLatch doesn't seem like
a great name. Maybe WaitEventSet, and then we can have
WaitLatch(&latch) and WaitEvents(&eventset).

Hm, I'd like to have latch in the name. It seems far from improbably to
have another wait data structure. LatchEventSet maybe? The wait would be
implied by WaitLatch.

So effectively we'd create a LatchEventSet feLatchSet; somewhere global
(and update it from a backend local to the proc latch in
SwitchToSharedLatch/SwitchBackToLocalLatch()). Then change all WaitLatch
calls to refer to those.

Do we want to provide a backward compatible API for all this? I'm fine
either way.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27

Robert Haas

robertmhaas@gmail.com

almost 10 years ago

In reply to: Andres Freund (#26)

Re: Performance degradation in commit ac1d794

On Thu, Jan 14, 2016 at 12:06 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-01-14 11:31:03 -0500, Robert Haas wrote:

On Thu, Jan 14, 2016 at 10:56 AM, Andres Freund <andres@anarazel.de> wrote:
I think your idea of a data structure the encapsulates a set of events
for which to wait is probably a good one. WaitLatch doesn't seem like
a great name. Maybe WaitEventSet, and then we can have
WaitLatch(&latch) and WaitEvents(&eventset).

Hm, I'd like to have latch in the name. It seems far from improbably to
have another wait data structure. LatchEventSet maybe? The wait would be
implied by WaitLatch.

I can live with that.

So effectively we'd create a LatchEventSet feLatchSet; somewhere global
(and update it from a backend local to the proc latch in
SwitchToSharedLatch/SwitchBackToLocalLatch()). Then change all WaitLatch
calls to refer to those.

Sure.

Do we want to provide a backward compatible API for all this? I'm fine
either way.

How would that work?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28

Tom Lane

tgl@sss.pgh.pa.us

almost 10 years ago

In reply to: Robert Haas (#27)

Re: Performance degradation in commit ac1d794

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Jan 14, 2016 at 12:06 PM, Andres Freund <andres@anarazel.de> wrote:

Do we want to provide a backward compatible API for all this? I'm fine
either way.

How would that work?

I see no great need to be backwards-compatible on this, especially if it
would complicate matters at all. I doubt there's a lot of third-party
code using WaitLatch right now. Just make sure there's an obvious
compile failure for anyone who is.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Robert Haas (#27)

Re: Performance degradation in commit ac1d794

On 2016-01-14 12:07:23 -0500, Robert Haas wrote:

Do we want to provide a backward compatible API for all this? I'm fine
either way.

How would that work?

I'm thinking of something like;

int WaitOnLatchSet(LatchEventSet *set, int wakeEvents, long timeout);

int
WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,long timeout)
{
LatchEventSet set;

LatchEventSetInit(&set, latch);

if (sock != PGINVALID_SOCKET)
LatchEventSetAddSock(&set, sock);

return WaitOnLatchSet(set, wakeEvents, timeout);
}

I think we'll need to continue having wakeEvents and timeout parameters
for WaitOnLatchSet, we quite frequently want to wait socket
readability/writability, not wait on the socket, or have/not have
timeouts.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Andres Freund (#29)

Re: Performance degradation in commit ac1d794

On 2016-01-14 18:14:21 +0100, Andres Freund wrote:

I'm thinking of something like;

int WaitOnLatchSet(LatchEventSet *set, int wakeEvents, long timeout);

int
WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,long timeout)
{
LatchEventSet set;

LatchEventSetInit(&set, latch);

if (sock != PGINVALID_SOCKET)
LatchEventSetAddSock(&set, sock);

return WaitOnLatchSet(set, wakeEvents, timeout);
}

I think we'll need to continue having wakeEvents and timeout parameters
for WaitOnLatchSet, we quite frequently want to wait socket
readability/writability, not wait on the socket, or have/not have
timeouts.

This brings me to something related: I'm wondering if we shouldn't merge
unix/win32_latch.c. If we go this route it seems like the amount of
shared infrastructure will further increase. The difference between
win32 and, say, the select code isn't much bigger than the difference
between select/poll. epoll/win32 are probably more similar than that
actually.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31

Tom Lane

tgl@sss.pgh.pa.us

almost 10 years ago

In reply to: Andres Freund (#30)

Re: Performance degradation in commit ac1d794

Andres Freund <andres@anarazel.de> writes:

This brings me to something related: I'm wondering if we shouldn't merge
unix/win32_latch.c.

Well, it's duplicated code on the one hand versus maze-of-ifdefs on the
other. Feel free to try it and see, but I'm unsure it'd be an improvement.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Andres Freund (#29)

Re: Performance degradation in commit ac1d794

On 2016-01-14 18:14:21 +0100, Andres Freund wrote:

On 2016-01-14 12:07:23 -0500, Robert Haas wrote:

Do we want to provide a backward compatible API for all this? I'm fine
either way.

How would that work?

I'm thinking of something like;

int WaitOnLatchSet(LatchEventSet *set, int wakeEvents, long timeout);

int
WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,long timeout)
{
LatchEventSet set;

LatchEventSetInit(&set, latch);

if (sock != PGINVALID_SOCKET)
LatchEventSetAddSock(&set, sock);

return WaitOnLatchSet(set, wakeEvents, timeout);
}

I think we'll need to continue having wakeEvents and timeout parameters
for WaitOnLatchSet, we quite frequently want to wait socket
readability/writability, not wait on the socket, or have/not have
timeouts.

Hm. If we really want to support multiple sockets at some point the
above WaitOnLatchSet signature isn't going to fly, because it won't
support figuring out which fd the event this triggered on.

So it seems we'd need to return something like
struct LatchEvent
{
enum LatchEventType {WL_LATCH_EVENT;WL_TIMEOUT;WL_SOCKET_EVENT; WL_POSTMASTER_EVENT;...} event_type;
int mask;
pgsocket event_sock;
};

that'd also allow to extend this to return multiple events if we want
that at some point. Alternatively we could add a pgsocket* argument, but
that doesn't really seem much better.

Not super happy about the above proposal.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33

Robert Haas

robertmhaas@gmail.com

almost 10 years ago

In reply to: Andres Freund (#29)

Re: Performance degradation in commit ac1d794

On Thu, Jan 14, 2016 at 12:14 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-01-14 12:07:23 -0500, Robert Haas wrote:

Do we want to provide a backward compatible API for all this? I'm fine
either way.

How would that work?

I'm thinking of something like;

int WaitOnLatchSet(LatchEventSet *set, int wakeEvents, long timeout);

int
WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,long timeout)
{
LatchEventSet set;

LatchEventSetInit(&set, latch);

if (sock != PGINVALID_SOCKET)
LatchEventSetAddSock(&set, sock);

return WaitOnLatchSet(set, wakeEvents, timeout);
}

I think we'll need to continue having wakeEvents and timeout parameters
for WaitOnLatchSet, we quite frequently want to wait socket
readability/writability, not wait on the socket, or have/not have
timeouts.

Well, if we ever wanted to support multiple FDs, we'd need the
readability/writeability thing to be per-fd, not per-set.

Overall, if this is what you have in mind for backward compatibility,
I rate it M for Meh. Let's just break compatibility and people will
have to update their code. That shouldn't be hard, and if we don't
make people do it when we make the change, then we'll be stuck with
the backward-compatibility interface for a decade. I doubt it's worth
it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 10 years ago

In reply to: Andres Freund (#32)

Re: Performance degradation in commit ac1d794

Hello, I am one who wants waiting on many sockets at once.

At Thu, 14 Jan 2016 18:55:51 +0100, Andres Freund <andres@anarazel.de> wrote in <20160114175551.GM10941@awork2.anarazel.de>

On 2016-01-14 18:14:21 +0100, Andres Freund wrote:

On 2016-01-14 12:07:23 -0500, Robert Haas wrote:

Do we want to provide a backward compatible API for all this? I'm fine
either way.

How would that work?

I'm thinking of something like;

int WaitOnLatchSet(LatchEventSet *set, int wakeEvents, long timeout);

int
WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,long timeout)
{
LatchEventSet set;

LatchEventSetInit(&set, latch);

if (sock != PGINVALID_SOCKET)
LatchEventSetAddSock(&set, sock);

return WaitOnLatchSet(set, wakeEvents, timeout);
}

I think we'll need to continue having wakeEvents and timeout parameters
for WaitOnLatchSet, we quite frequently want to wait socket
readability/writability, not wait on the socket, or have/not have
timeouts.

Hm. If we really want to support multiple sockets at some point the
above WaitOnLatchSet signature isn't going to fly, because it won't
support figuring out which fd the event this triggered on.

So it seems we'd need to return something like
struct LatchEvent
{
enum LatchEventType {WL_LATCH_EVENT;WL_TIMEOUT;WL_SOCKET_EVENT; WL_POSTMASTER_EVENT;...} event_type;
int mask;
pgsocket event_sock;
};

that'd also allow to extend this to return multiple events if we want
that at some point. Alternatively we could add a pgsocket* argument, but
that doesn't really seem much better.

Not super happy about the above proposal.

How about allowing registration of a callback for every waiting
socket. The signature of the callback function would be like

enum LATCH_CALLBACK_STATE
LatchWaitCallback(pgsocket event_sock,
enum LatchEventType, int mask?, void *bogus);

It can return, for instance, LCB_CONTINUE, LCB_BREAK or
LCB_IMMEDBREAK, and if any one of them returns LCB_BREAK, it will
break after, maybe, calling all callbacks for fired events.

We could have predefined callbacks for every event which does
only setting a corresponding flag and returns LCB_BREAK.

/* Waiting set has been constructed so far */
if (!WaitOnLatchSet(&set(?))
errorr()

if (is_sock_readable[sockid]) {} /** is... would be global /
if (is_sock_writable[sockid]) {} /** is... would be global /
/* Any other types of trigger would processes elsewhere */

Although it might be slow if we have an enormous number of
sockets fired at once, I suppose it returns only for a few
sockets on most cases.

I don't see the big picture of the whole latch/signalling
mechanism with this, but callbacks might be usable for signalling
on many sockets.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#35

Robert Haas

robertmhaas@gmail.com

almost 10 years ago

In reply to: Tom Lane (#31)

Re: Performance degradation in commit ac1d794

On Thu, Jan 14, 2016 at 12:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andres Freund <andres@anarazel.de> writes:

This brings me to something related: I'm wondering if we shouldn't merge
unix/win32_latch.c.

Well, it's duplicated code on the one hand versus maze-of-ifdefs on the
other. Feel free to try it and see, but I'm unsure it'd be an improvement.

I think we should either get this fixed RSN or revert the problematic
commit until we get it fixed. I'd be rather disappointed about the
latter because I think this was a very good thing on the merits, but
probably not good enough to justify taking the performance hit over
the long term.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36

Tom Lane

tgl@sss.pgh.pa.us

almost 10 years ago

In reply to: Robert Haas (#35)

Re: Performance degradation in commit ac1d794

Robert Haas <robertmhaas@gmail.com> writes:

I think we should either get this fixed RSN or revert the problematic
commit until we get it fixed. I'd be rather disappointed about the
latter because I think this was a very good thing on the merits, but
probably not good enough to justify taking the performance hit over
the long term.

Since it's only in HEAD, I'm not seeing the urgency of reverting it.
However, it'd be a good idea to put this on the 9.6 open items list
(have we got such a page yet?) to make sure it gets addressed before
beta.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37

Robert Haas

robertmhaas@gmail.com

almost 10 years ago

In reply to: Tom Lane (#36)

Re: Performance degradation in commit ac1d794

On Thu, Feb 11, 2016 at 12:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

I think we should either get this fixed RSN or revert the problematic
commit until we get it fixed. I'd be rather disappointed about the
latter because I think this was a very good thing on the merits, but
probably not good enough to justify taking the performance hit over
the long term.

Since it's only in HEAD, I'm not seeing the urgency of reverting it.
However, it'd be a good idea to put this on the 9.6 open items list
(have we got such a page yet?) to make sure it gets addressed before
beta.

One problem is that it makes for misleading results if you try to
benchmark 9.5 against 9.6.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Robert Haas (#37)

Re: Performance degradation in commit ac1d794

On 2016-02-11 12:50:58 -0500, Robert Haas wrote:

On Thu, Feb 11, 2016 at 12:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

I think we should either get this fixed RSN or revert the problematic
commit until we get it fixed. I'd be rather disappointed about the
latter because I think this was a very good thing on the merits, but
probably not good enough to justify taking the performance hit over
the long term.

Since it's only in HEAD, I'm not seeing the urgency of reverting it.
However, it'd be a good idea to put this on the 9.6 open items list
(have we got such a page yet?) to make sure it gets addressed before
beta.

One problem is that it makes for misleading results if you try to
benchmark 9.5 against 9.6.

You need a really beefy box to show the problem. On a large/new 2 socket
machine the performance regression in in the 1-3% range for a pgbench of
SELECT 1. So it's not like it's immediately showing up for everyone.

Putting it on the open items list sounds good to me.

Regards,

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39

Robert Haas

robertmhaas@gmail.com

almost 10 years ago

In reply to: Andres Freund (#38)

Re: Performance degradation in commit ac1d794

On Thu, Feb 11, 2016 at 12:53 PM, Andres Freund <andres@anarazel.de> wrote:

One problem is that it makes for misleading results if you try to
benchmark 9.5 against 9.6.

You need a really beefy box to show the problem. On a large/new 2 socket
machine the performance regression in in the 1-3% range for a pgbench of
SELECT 1. So it's not like it's immediately showing up for everyone.

Putting it on the open items list sounds good to me.

Well, OK, I've done that then. I don't really agree that it's not a
problem; the OP said he saw a 3x regression, and some of my colleagues
doing benchmarking are complaining about this commit, too. It doesn't
seem like much of a stretch to think that it might be affecting other
people as well.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Robert Haas (#39)

Re: Performance degradation in commit ac1d794

On 2016-02-11 13:09:27 -0500, Robert Haas wrote:

On Thu, Feb 11, 2016 at 12:53 PM, Andres Freund <andres@anarazel.de> wrote:

One problem is that it makes for misleading results if you try to
benchmark 9.5 against 9.6.

You need a really beefy box to show the problem. On a large/new 2 socket
machine the performance regression in in the 1-3% range for a pgbench of
SELECT 1. So it's not like it's immediately showing up for everyone.

Putting it on the open items list sounds good to me.

Well, OK, I've done that then. I don't really agree that it's not a
problem; the OP said he saw a 3x regression, and some of my colleagues
doing benchmarking are complaining about this commit, too. It doesn't
seem like much of a stretch to think that it might be affecting other
people as well.

Well, I can't do anything about that right now. I won't have the time to
whip up the new/more complex API we discussed upthread in the next few
days. So either we go with a simpler API (e.g. pretty much a cleaned up
version of my earlier patch), revert the postmaster deatch check, or
somebody else has to take lead in renovating, or we wait...

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41

Robert Haas

robertmhaas@gmail.com

almost 10 years ago

In reply to: Andres Freund (#40)

Re: Performance degradation in commit ac1d794

On Thu, Feb 11, 2016 at 1:19 PM, Andres Freund <andres@anarazel.de> wrote:

Putting it on the open items list sounds good to me.

Well, OK, I've done that then. I don't really agree that it's not a
problem; the OP said he saw a 3x regression, and some of my colleagues
doing benchmarking are complaining about this commit, too. It doesn't
seem like much of a stretch to think that it might be affecting other
people as well.

Well, I can't do anything about that right now. I won't have the time to
whip up the new/more complex API we discussed upthread in the next few
days. So either we go with a simpler API (e.g. pretty much a cleaned up
version of my earlier patch), revert the postmaster deatch check, or
somebody else has to take lead in renovating, or we wait...

Well, I thought we could just revert the patch until you had time to
deal with it, and then put it back in. That seemed like a simple and
practical option from here, and I don't think I quite understand why
you and Tom don't like it. I don't have a problem with deferring to
the majority will here, but I would sort of like to understand the
reason for the majority will.

BTW, if need be, I can look for an EnterpriseDB resource to work on
this. It won't likely be me, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#42

Tom Lane

tgl@sss.pgh.pa.us

almost 10 years ago

In reply to: Robert Haas (#41)

Re: Performance degradation in commit ac1d794

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Feb 11, 2016 at 1:19 PM, Andres Freund <andres@anarazel.de> wrote:

Well, I can't do anything about that right now. I won't have the time to
whip up the new/more complex API we discussed upthread in the next few
days. So either we go with a simpler API (e.g. pretty much a cleaned up
version of my earlier patch), revert the postmaster deatch check, or
somebody else has to take lead in renovating, or we wait...

Well, I thought we could just revert the patch until you had time to
deal with it, and then put it back in. That seemed like a simple and
practical option from here, and I don't think I quite understand why
you and Tom don't like it.

Don't particularly want the git history churn, if we expect that the
patch will ship as-committed in 9.6. If it becomes clear that the
performance fix is unlikely to happen, we can revert then.

If the performance change were an issue for a lot of testing, I'd agree
with a temporary revert, but I concur with Andres that it's not blocking
much. Anybody who does have an issue there can revert locally, no?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#43

Robert Haas

robertmhaas@gmail.com

almost 10 years ago

In reply to: Tom Lane (#42)

Re: Performance degradation in commit ac1d794

On Thu, Feb 11, 2016 at 1:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Feb 11, 2016 at 1:19 PM, Andres Freund <andres@anarazel.de> wrote:

Well, I can't do anything about that right now. I won't have the time to
whip up the new/more complex API we discussed upthread in the next few
days. So either we go with a simpler API (e.g. pretty much a cleaned up
version of my earlier patch), revert the postmaster deatch check, or
somebody else has to take lead in renovating, or we wait...

Well, I thought we could just revert the patch until you had time to
deal with it, and then put it back in. That seemed like a simple and
practical option from here, and I don't think I quite understand why
you and Tom don't like it.

Don't particularly want the git history churn, if we expect that the
patch will ship as-committed in 9.6. If it becomes clear that the
performance fix is unlikely to happen, we can revert then.

If the performance change were an issue for a lot of testing, I'd agree
with a temporary revert, but I concur with Andres that it's not blocking
much. Anybody who does have an issue there can revert locally, no?

True. Maybe we'll just have to start doing that for EnterpriseDB
benchmarking as standard practice. Not sure everybody who is
benchmarking will realize the issue though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#44

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 10 years ago

In reply to: Robert Haas (#33)

1 attachment(s)

[PoC] WaitLatchOrSocketMulti (Re: Performance degradation in commit ac1d794)

Hello, I don't see how ac1d794 will be dealt, but I tried an
example implement of multi-socket version of WaitLatchOrSocket
using callbacks on top of the current master where ac1d794 has
not been removed yet.

At Thu, 14 Jan 2016 13:46:44 -0500, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoYBa8TJRGS07JCSLKpqGkrRd5hLpirvwp36s=83ChmQDA@mail.gmail.com>

On Thu, Jan 14, 2016 at 12:14 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-01-14 12:07:23 -0500, Robert Haas wrote:

Do we want to provide a backward compatible API for all this? I'm fine
either way.

How would that work?

I'm thinking of something like;

int WaitOnLatchSet(LatchEventSet *set, int wakeEvents, long timeout);

int
WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,long timeout)
{
LatchEventSet set;

LatchEventSetInit(&set, latch);

if (sock != PGINVALID_SOCKET)
LatchEventSetAddSock(&set, sock);

return WaitOnLatchSet(set, wakeEvents, timeout);
}

I think we'll need to continue having wakeEvents and timeout parameters
for WaitOnLatchSet, we quite frequently want to wait socket
readability/writability, not wait on the socket, or have/not have
timeouts.

Well, if we ever wanted to support multiple FDs, we'd need the
readability/writeability thing to be per-fd, not per-set.

Overall, if this is what you have in mind for backward compatibility,
I rate it M for Meh. Let's just break compatibility and people will
have to update their code. That shouldn't be hard, and if we don't
make people do it when we make the change, then we'll be stuck with
the backward-compatibility interface for a decade. I doubt it's worth
it.

The API is similar to what Robert suggested but different because
it would too complicate a bit for the most cases. So this example
implement has an intermediate style of the current API and the
Robert's suggestion, and using callbacks as I proposed.

int WaitLatchOrSocketMulti(pgwaitobject *wobjs, int nobjs, long timeout);

This is implemented only for poll, not for select.

A sample usage is seen in secure_read().

pgwaitobject objs[3];

...

InitWaitLatch(objs[0], MyLatch);
InitWaitPostmasterDeath(objs[1]);
InitWaitSocket(objs[2], port->sock, waitfor);

w = WaitLatchOrSocketMulti(objs, 3, 0);
// w = WaitLatchOrSocket(MyLatch,
// WL_LATCH_SET | WL_POSTMASTER_DEATH | waitfor,
// port->sock, 0);

The core of the function looks as the following. It runs
callbacks for every fired events.

rc = poll(pfds, nfds, (int) cur_timeout);

...

if (rc < 0)

...

else
{
for (i = 0 ; i < nfds ; i++)
{
wobjs[i].retEvents = 0;
if (pfds[i].revents && wobjs[i].cb)
result |= wobjs[i].cb(&wobjs[i], pfds[i].revents);

if (result & WL_IMMEDIATELY_BREAK)
break;
}
}

In the above part, poll()'s event is passed the callbacks so
callbacks may have a different inplement for select().

Having a callback for sockets. The initializer could be as the
following.

InitWaitSocketCB(wait_obj, sock, event, your_callback);

If we want to have the waiting-object array independently from
specific functions to achieve asynchronous handling of socket
events. It could be realised by providing a set of wrapper
functions as exactly what Robert said as above.

Does this make sense?
Does anyone have any opinion? or thoughts?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-PoC-Add-mult-socket-version-of-WaitLatchOrSocket.patchtext/x-patch; charset=us-asciiDownload

From b7cc9939ea61654fae98c4fe958c8c67df9f3758 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 19 Feb 2016 16:34:49 +0900
Subject: [PATCH] [PoC] Add mult-socket version of WaitLatchOrSocket.

---
 src/backend/libpq/be-secure.c |  12 +-
 src/backend/port/unix_latch.c | 268 ++++++++++++++++++++++++++++++++++++++++++
 src/include/storage/latch.h   |  29 +++++
 3 files changed, 306 insertions(+), 3 deletions(-)

diff --git a/src/backend/libpq/be-secure.c b/src/backend/libpq/be-secure.c
index ac709d1..d99a983 100644
--- a/src/backend/libpq/be-secure.c
+++ b/src/backend/libpq/be-secure.c
@@ -141,12 +141,18 @@ retry:
 	if (n < 0 && !port->noblock && (errno == EWOULDBLOCK || errno == EAGAIN))
 	{
 		int			w;
+		pgwaitobject objs[3];
 
 		Assert(waitfor);
 
-		w = WaitLatchOrSocket(MyLatch,
-							  WL_LATCH_SET | WL_POSTMASTER_DEATH | waitfor,
-							  port->sock, 0);
+		InitWaitLatch(objs[0], MyLatch);
+		InitWaitPostmasterDeath(objs[1]);
+		InitWaitSocket(objs[2], port->sock, waitfor);
+
+		w = WaitLatchOrSocketMulti(objs, 3, 0);
+//		w = WaitLatchOrSocket(MyLatch,
+//							  WL_LATCH_SET | WL_POSTMASTER_DEATH | waitfor,
+//							  port->sock, 0);
 
 		/*
 		 * If the postmaster has died, it's not safe to continue running,
diff --git a/src/backend/port/unix_latch.c b/src/backend/port/unix_latch.c
index 2ad609c..dacb869 100644
--- a/src/backend/port/unix_latch.c
+++ b/src/backend/port/unix_latch.c
@@ -504,6 +504,274 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	return result;
 }
 
+int
+WAITOBJCB_SOCK_DEF(pgwaitobject *wobj, int revents)
+{
+	int r = 0;
+
+	if (revents & POLLIN)
+	{
+		/* data available in socket, or EOF/error
+		 * condition */
+		wobj->retEvents |= WL_SOCKET_READABLE;
+		r |= WL_SOCKET_READABLE;
+	}
+	if (revents & POLLOUT)
+	{
+		/* socket is writable */
+		wobj->retEvents |= WL_SOCKET_WRITEABLE;
+		r |= WL_SOCKET_WRITEABLE;
+	}
+	if (revents & (POLLHUP | POLLERR | POLLNVAL))
+	{
+		/* EOF/error condition */
+		if (wobj->wakeEvents & WL_SOCKET_READABLE)
+		{
+			wobj->retEvents |= WL_SOCKET_READABLE;
+			r |= WL_SOCKET_READABLE;
+		}
+		if (wobj->wakeEvents & WL_SOCKET_WRITEABLE)
+		{
+			wobj->retEvents |= WL_SOCKET_WRITEABLE;
+			r |= WL_SOCKET_WRITEABLE;
+		}
+	}
+
+	return r;
+}
+
+int
+WAITOBJCB_PMDEATH_DEF(pgwaitobject *wobj, int revents)
+{
+	int r = 0;
+
+	/*
+	 * We expect a POLLHUP when the remote end is closed, but because we don't
+	 * expect the pipe to become readable or to have any errors either, treat
+	 * those cases as postmaster death, too.
+	 */
+	if (revents & (POLLHUP | POLLIN | POLLERR | POLLNVAL))
+	{
+		/*
+		 * According to the select(2) man page on Linux, select(2) may
+		 * spuriously return and report a file descriptor as readable, when
+		 * it's not; and presumably so can poll(2).  It's not clear that the
+		 * relevant cases would ever apply to the postmaster pipe, but since
+		 * the consequences of falsely returning WL_POSTMASTER_DEATH could be
+		 * pretty unpleasant, we take the trouble to positively verify EOF
+		 * with PostmasterIsAlive().
+		 */
+		if (!PostmasterIsAlive())
+		{
+			wobj->retEvents |= WL_POSTMASTER_DEATH;
+			r |= WL_POSTMASTER_DEATH;
+		}
+	}
+
+	return r;
+}
+
+int
+WAITOBJCB_LATCH_DEF(pgwaitobject *wobj, int revents)
+{
+	
+	return 0;
+}
+
+/*
+ * Like WaitLatch, but with an extra socket argument for WL_SOCKET_*
+ * conditions.
+ *
+ * When waiting on a socket, EOF and error conditions are reported by
+ * returning the socket as readable/writable or both, depending on
+ * WL_SOCKET_READABLE/WL_SOCKET_WRITEABLE being specified.
+ */
+int
+WaitLatchOrSocketMulti(pgwaitobject *wobjs, int nobjs, long timeout)
+{
+	int			result = 0;
+	int			rc;
+	instr_time	start_time,
+				cur_time;
+	long		cur_timeout;
+	int			wakeEvents = 0;
+	volatile Latch  *waitlatch = NULL;
+	pgwaitobject *latchobj = NULL;
+	int 		i;
+
+	struct pollfd *pfds;
+	int			nfds;
+
+	/* If timeout > 0, WL_TIMEOUT is implicated */
+	wakeEvents = (timeout > 0 ? WL_TIMEOUT : 0);
+
+	for (i = 0 ; i < nobjs ; i++)
+	{
+		switch (wobjs[i].type)
+		{
+		case WAITOBJ_LATCH:
+			Assert(!waitlatch);
+			latchobj = &wobjs[i];
+			waitlatch = latchobj->latch;
+			if (waitlatch->owner_pid != MyProcPid)
+				elog(ERROR, "cannot wait on a latch owned by another process");
+			wobjs[i].wakeEvents = WL_LATCH_SET;
+			break;
+		case WAITOBJ_POSTMASTER_DEATH:
+			wobjs[i].wakeEvents = WL_POSTMASTER_DEATH;
+			break;
+		case WAITOBJ_SOCK:
+			Assert((wobjs[i].wakeEvents & 
+					~(WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) == 0);
+			break;
+		default:
+			elog(ERROR, "Unknown wait object type: %d", wobjs[i].type);
+		}
+		wakeEvents |= wobjs[i].wakeEvents;
+	}
+	Assert(wakeEvents != 0);	/* must have at least one wake event */
+
+	/*
+	 * Initialize timeout if requested.  We must record the current time so
+	 * that we can determine the remaining timeout if the poll() or select()
+	 * is interrupted.  (On some platforms, select() will update the contents
+	 * of "tv" for us, but unfortunately we can't rely on that.)
+	 */
+	if (wakeEvents & WL_TIMEOUT)
+	{
+		INSTR_TIME_SET_CURRENT(start_time);
+		Assert(timeout >= 0 && timeout <= INT_MAX);
+		cur_timeout = timeout;
+	}
+	else
+	{
+		cur_timeout = -1;
+
+	}
+
+	pfds = palloc(sizeof(struct pollfd) * nobjs);
+
+	for (nfds = 0 ; nfds < nobjs ; nfds++)
+	{
+		switch (wobjs[nfds].type)
+		{
+		case WAITOBJ_SOCK:
+			pfds[nfds].fd = wobjs[nfds].sock;
+			pfds[nfds].events = 0;
+			if (wobjs[nfds].wakeEvents & WL_SOCKET_READABLE)
+				pfds[nfds].events |= POLLIN;
+			if (wobjs[nfds].wakeEvents & WL_SOCKET_WRITEABLE)
+				pfds[nfds].events |= POLLOUT;
+			break;
+
+		case WAITOBJ_LATCH:
+			pfds[nfds].fd = selfpipe_readfd;
+			pfds[nfds].events = POLLIN;
+			break;
+
+		case WAITOBJ_POSTMASTER_DEATH:
+			/* postmaster fd, if used, is always in pfds[nfds - 1] */
+			pfds[nfds].fd = postmaster_alive_fds[POSTMASTER_FD_WATCH];
+			pfds[nfds].events = POLLIN;
+		}
+	}
+
+	waiting = true;
+	do
+	{
+		/*
+		 * Clear the pipe, then check if the latch is set already. If someone
+		 * sets the latch between this and the poll()/select() below, the
+		 * setter will write a byte to the pipe (or signal us and the signal
+		 * handler will do that), and the poll()/select() will return
+		 * immediately.
+		 *
+		 * Note: we assume that the kernel calls involved in drainSelfPipe()
+		 * and SetLatch() will provide adequate synchronization on machines
+		 * with weak memory ordering, so that we cannot miss seeing is_set if
+		 * the signal byte is already in the pipe when we drain it.
+		 */
+		drainSelfPipe();
+
+		if ((wakeEvents & WL_LATCH_SET) && waitlatch->is_set)
+		{
+			/*
+			 * Leave loop immediately, avoid blocking again. We don't attempt
+			 * to report any other events that might also be satisfied.
+			 */
+			result |= latchobj->cb(latchobj, 0);
+
+			/* If callback is provided, follow its order */
+			if (result & WL_IMMEDIATELY_BREAK)
+				break;
+		}
+
+		/*
+		 * Must wait ... we use poll(2) if available, otherwise select(2).
+		 *
+		 * On at least older linux kernels select(), in violation of POSIX,
+		 * doesn't reliably return a socket as writable if closed - but we
+		 * rely on that. So far all the known cases of this problem are on
+		 * platforms that also provide a poll() implementation without that
+		 * bug.  If we find one where that's not the case, we'll need to add a
+		 * workaround.
+		 */
+		for (i = 0 ; i < nfds ; i++)
+			pfds[i].revents = 0;
+
+		/* Sleep */
+		rc = poll(pfds, nfds, (int) cur_timeout);
+
+		/* Check return code */
+		if (rc < 0)
+		{
+			/* EINTR is okay, otherwise complain */
+			if (errno != EINTR)
+			{
+				waiting = false;
+				ereport(ERROR,
+						(errcode_for_socket_access(),
+						 errmsg("poll() failed: %m")));
+			}
+		}
+		else if (rc == 0)
+		{
+			/* timeout exceeded */
+			if (wakeEvents & WL_TIMEOUT)
+				result |= WL_TIMEOUT;
+		}
+		else
+		{
+			for (i = 0 ; i < nfds ; i++)
+			{
+				wobjs[i].retEvents = 0;
+				if (pfds[i].revents && wobjs[i].cb)
+					result |= wobjs[i].cb(&wobjs[i], pfds[i].revents);
+
+				if (result & WL_IMMEDIATELY_BREAK)
+					break;
+			}
+		}
+
+		/* If we're not done, update cur_timeout for next iteration */
+		if (result == 0 && (wakeEvents & WL_TIMEOUT))
+		{
+			INSTR_TIME_SET_CURRENT(cur_time);
+			INSTR_TIME_SUBTRACT(cur_time, start_time);
+			cur_timeout = timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
+			if (cur_timeout <= 0)
+			{
+				/* Timeout has expired, no need to continue looping */
+				result |= WL_TIMEOUT;
+			}
+		}
+	} while (result == 0);
+	waiting = false;
+
+	return result;
+}
+
+
 /*
  * Sets a latch and wakes up anyone waiting on it.
  *
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index e77491e..4fe87ed 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -95,12 +95,37 @@ typedef struct Latch
 #endif
 } Latch;
 
+struct pgwaitobject;
+typedef int (*WaitSockCallback)(struct pgwaitobject *wobjs,	int revents);
+typedef enum pgwaitobjtype
+{
+	WAITOBJ_LATCH,
+	WAITOBJ_POSTMASTER_DEATH,
+	WAITOBJ_SOCK
+} pgwaitobjtype;
+
+typedef struct pgwaitobject
+{
+	pgwaitobjtype type;
+	pgsocket sock;
+	Latch *latch;
+	int wakeEvents;
+	int retEvents;
+	WaitSockCallback cb;
+	void *param;
+} pgwaitobject;
+
+#define InitWaitLatch(o, l) ((o).type=WAITOBJ_LATCH,(o).latch=(l),(o).cb=WAITOBJCB_LATCH_DEF)
+#define InitWaitPostmasterDeath(o) ((o).type=WAITOBJ_POSTMASTER_DEATH,(o).cb=WAITOBJCB_PMDEATH_DEF)
+#define InitWaitSocket(o, s, e) ((o).type=WAITOBJ_SOCK,(o).sock=(s),(o).wakeEvents=(e),(o).cb=WAITOBJCB_SOCK_DEF)
+
 /* Bitmasks for events that may wake-up WaitLatch() clients */
 #define WL_LATCH_SET		 (1 << 0)
 #define WL_SOCKET_READABLE	 (1 << 1)
 #define WL_SOCKET_WRITEABLE  (1 << 2)
 #define WL_TIMEOUT			 (1 << 3)
 #define WL_POSTMASTER_DEATH  (1 << 4)
+#define WL_IMMEDIATELY_BREAK (1 << 5)
 
 /*
  * prototypes for functions in latch.c
@@ -113,6 +138,10 @@ extern void DisownLatch(volatile Latch *latch);
 extern int	WaitLatch(volatile Latch *latch, int wakeEvents, long timeout);
 extern int WaitLatchOrSocket(volatile Latch *latch, int wakeEvents,
 				  pgsocket sock, long timeout);
+extern int WAITOBJCB_LATCH_DEF(pgwaitobject *wobj, int revents);
+extern int WAITOBJCB_SOCK_DEF(pgwaitobject *wobj, int revents);
+extern int WAITOBJCB_PMDEATH_DEF(pgwaitobject *wobj, int revents);
+extern int WaitLatchOrSocketMulti(pgwaitobject *wobjs, int nelem, long timeout);
 extern void SetLatch(volatile Latch *latch);
 extern void ResetLatch(volatile Latch *latch);
 
-- 
1.8.3.1

#45

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Kyotaro HORIGUCHI (#34)

Re: Performance degradation in commit ac1d794

On 2016-02-08 17:49:18 +0900, Kyotaro HORIGUCHI wrote:

How about allowing registration of a callback for every waiting
socket. The signature of the callback function would be like

I don't think a a callback based API is going to serve us well. Most of
the current latch callers would get noticeably more complex that
way. And a number of them will benefit from latches using epoll
internally.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#46

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Robert Haas (#27)

Re: Performance degradation in commit ac1d794

On 2016-01-14 12:07:23 -0500, Robert Haas wrote:

On Thu, Jan 14, 2016 at 12:06 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-01-14 11:31:03 -0500, Robert Haas wrote:

On Thu, Jan 14, 2016 at 10:56 AM, Andres Freund <andres@anarazel.de> wrote:
I think your idea of a data structure the encapsulates a set of events
for which to wait is probably a good one. WaitLatch doesn't seem like
a great name. Maybe WaitEventSet, and then we can have
WaitLatch(&latch) and WaitEvents(&eventset).

Hm, I'd like to have latch in the name. It seems far from improbably to
have another wait data structure. LatchEventSet maybe? The wait would be
implied by WaitLatch.

I can live with that.

How about the following sketch of an API

typedef struct LatchEvent
{
uint32 events; /* interesting events */
uint32 revents; /* returned events */
int fd; /* fd associated with event */
} LatchEvent;

typedef struct LatchEventSet
{
int nevents;
LatchEvent events[FLEXIBLE_ARRAY_MEMBER];
} LatchEventSet;

/*
* Create a LatchEventSet with space for nevents different events to wait for.
*
* latch may be NULL.
*/
extern LatchEventSet *CreateLatchEventSet(int nevents, Latch *latch);

/* ---
* Add an event to the set. Possible events are:
* - WL_LATCH_SET: Wait for the latch to be set
* - WL_SOCKET_READABLE: Wait for socket to become readable
* can be combined in one event with WL_SOCKET_WRITEABLE
* - WL_SOCKET_WRITABLE: Wait for socket to become readable
* can be combined with WL_SOCKET_READABLE
* - WL_POSTMASTER_DEATH: Wait for postmaster to die
*/
extern void AddLatchEventToSet(LatchEventSet *set, uint32 events, int fd);

/*
* Wait for any events added to the set to happen, or until the timeout is
* reached.
*
* The return value is the union of all the events that were detected. This
* allows to avoid having to look into the associated events[i].revents
* fields.
*/
extern uint32 WaitLatchEventSet(LatchEventSet *set, long timeout);

I've two questions:
- Is there any benefit of being able to wait for more than one latch?
I'm inclined to not allow that for now, that'd make the patch bigger,
and I don't see a use-case right now.
- Given current users we don't need a large amount of events, so having
to iterate through the registered events doesn't seem bothersome. We
could however change the api to be something like

int WaitLatchEventSet(LatchEventSet *set, OccurredEvents *, int nevents, long timeout);

which would return the number of events that happened, and would
basically "fill" one of the (usually stack allocated) OccurredEvent
structures with what happened.

Comments?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47

Robert Haas

robertmhaas@gmail.com

almost 10 years ago

In reply to: Andres Freund (#46)

Re: Performance degradation in commit ac1d794

On Wed, Mar 16, 2016 at 2:52 PM, Andres Freund <andres@anarazel.de> wrote:

How about the following sketch of an API

typedef struct LatchEvent
{
uint32 events; /* interesting events */
uint32 revents; /* returned events */
int fd; /* fd associated with event */
} LatchEvent;

typedef struct LatchEventSet
{
int nevents;
LatchEvent events[FLEXIBLE_ARRAY_MEMBER];
} LatchEventSet;

/*
* Create a LatchEventSet with space for nevents different events to wait for.
*
* latch may be NULL.
*/
extern LatchEventSet *CreateLatchEventSet(int nevents, Latch *latch);

We might be able to rejigger this so that it didn't require palloc, if
we got rid of FLEXIBLE_ARRAY_MEMBER and passed int nevents and
LatchEvent * separately to WaitLatchThingy(). But I guess maybe this
will be infrequent enough not to matter.

I've two questions:
- Is there any benefit of being able to wait for more than one latch?
I'm inclined to not allow that for now, that'd make the patch bigger,
and I don't see a use-case right now.

I don't see a use case, either.

- Given current users we don't need a large amount of events, so having
to iterate through the registered events doesn't seem bothersome. We
could however change the api to be something like

int WaitLatchEventSet(LatchEventSet *set, OccurredEvents *, int nevents, long timeout);

which would return the number of events that happened, and would
basically "fill" one of the (usually stack allocated) OccurredEvent
structures with what happened.

I definitely think something along these lines is useful. I want to
be able to have an Append node with 100 ForeignScans under it and kick
off all the scans asynchronously and wait for all of the FDs at once.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#48

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Robert Haas (#47)

Re: Performance degradation in commit ac1d794

On 2016-03-16 15:08:07 -0400, Robert Haas wrote:

On Wed, Mar 16, 2016 at 2:52 PM, Andres Freund <andres@anarazel.de> wrote:

How about the following sketch of an API

typedef struct LatchEvent
{
uint32 events; /* interesting events */
uint32 revents; /* returned events */
int fd; /* fd associated with event */
} LatchEvent;

typedef struct LatchEventSet
{
int nevents;
LatchEvent events[FLEXIBLE_ARRAY_MEMBER];
} LatchEventSet;

/*
* Create a LatchEventSet with space for nevents different events to wait for.
*
* latch may be NULL.
*/
extern LatchEventSet *CreateLatchEventSet(int nevents, Latch *latch);

We might be able to rejigger this so that it didn't require palloc, if
we got rid of FLEXIBLE_ARRAY_MEMBER and passed int nevents and
LatchEvent * separately to WaitLatchThingy(). But I guess maybe this
will be infrequent enough not to matter.

I think we'll basically end up allocating them once for the frequent
callsites.

- Given current users we don't need a large amount of events, so having
to iterate through the registered events doesn't seem bothersome. We
could however change the api to be something like

int WaitLatchEventSet(LatchEventSet *set, OccurredEvents *, int nevents, long timeout);

which would return the number of events that happened, and would
basically "fill" one of the (usually stack allocated) OccurredEvent
structures with what happened.

I definitely think something along these lines is useful. I want to
be able to have an Append node with 100 ForeignScans under it and kick
off all the scans asynchronously and wait for all of the FDs at once.

So you'd like to get only an event for the FD with data back? Or are you
ok with iterating through hundred elements in an array, to see which are
ready?

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#49

Robert Haas

robertmhaas@gmail.com

almost 10 years ago

In reply to: Andres Freund (#48)

Re: Performance degradation in commit ac1d794

On Wed, Mar 16, 2016 at 3:25 PM, Andres Freund <andres@anarazel.de> wrote:

- Given current users we don't need a large amount of events, so having
to iterate through the registered events doesn't seem bothersome. We
could however change the api to be something like

int WaitLatchEventSet(LatchEventSet *set, OccurredEvents *, int nevents, long timeout);

which would return the number of events that happened, and would
basically "fill" one of the (usually stack allocated) OccurredEvent
structures with what happened.

I definitely think something along these lines is useful. I want to
be able to have an Append node with 100 ForeignScans under it and kick
off all the scans asynchronously and wait for all of the FDs at once.

So you'd like to get only an event for the FD with data back? Or are you
ok with iterating through hundred elements in an array, to see which are
ready?

I'd like to get an event back for the FD with data. Iterating sounds
like it could be really slow. Say you get lots of little packets back
from the same connection, while the others are idle. Now you've got
to keep iterating through them all over and over again. Blech.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#50

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Robert Haas (#49)

Re: Performance degradation in commit ac1d794

On 2016-03-16 15:41:28 -0400, Robert Haas wrote:

On Wed, Mar 16, 2016 at 3:25 PM, Andres Freund <andres@anarazel.de> wrote:

- Given current users we don't need a large amount of events, so having
to iterate through the registered events doesn't seem bothersome. We
could however change the api to be something like

int WaitLatchEventSet(LatchEventSet *set, OccurredEvents *, int nevents, long timeout);

which would return the number of events that happened, and would
basically "fill" one of the (usually stack allocated) OccurredEvent
structures with what happened.

I definitely think something along these lines is useful. I want to
be able to have an Append node with 100 ForeignScans under it and kick
off all the scans asynchronously and wait for all of the FDs at once.

So you'd like to get only an event for the FD with data back? Or are you
ok with iterating through hundred elements in an array, to see which are
ready?

I'd like to get an event back for the FD with data. Iterating sounds
like it could be really slow. Say you get lots of little packets back
from the same connection, while the others are idle. Now you've got
to keep iterating through them all over and over again. Blech.

Well, that's what poll() and select() require you to do internally
anyway, even if we abstract it away. But most platforms have better
implementations (epoll, kqueue, ...), so it seems fair to design for
those.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#51

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Robert Haas (#49)

6 attachment(s)

Re: Performance degradation in commit ac1d794

Hi,

On 2016-03-16 15:41:28 -0400, Robert Haas wrote:

On Wed, Mar 16, 2016 at 3:25 PM, Andres Freund <andres@anarazel.de> wrote:

- Given current users we don't need a large amount of events, so having
to iterate through the registered events doesn't seem bothersome. We
could however change the api to be something like

int WaitLatchEventSet(LatchEventSet *set, OccurredEvents *, int nevents, long timeout);

which would return the number of events that happened, and would
basically "fill" one of the (usually stack allocated) OccurredEvent
structures with what happened.

I definitely think something along these lines is useful. I want to
be able to have an Append node with 100 ForeignScans under it and kick
off all the scans asynchronously and wait for all of the FDs at once.

So you'd like to get only an event for the FD with data back? Or are you
ok with iterating through hundred elements in an array, to see which are
ready?

I'd like to get an event back for the FD with data. Iterating sounds
like it could be really slow. Say you get lots of little packets back
from the same connection, while the others are idle. Now you've got
to keep iterating through them all over and over again. Blech.

I've a (working) WIP version that works like I think you want. It's
implemented in patch 05 (rework with just poll() support) and (add epoll
suppport). It's based on patches posted here earlier, but these aren't
interesting for the discussion.

The API is now:

typedef struct WaitEventSet WaitEventSet;

typedef struct WaitEvent
{
int pos; /* position in the event data structure */
uint32 events; /* tripped events */
int fd; /* fd associated with event */
} WaitEvent;

/*
* Create a WaitEventSet with space for nevents different events to wait for.
*
* latch may be NULL.
*/
extern WaitEventSet *CreateWaitEventSet(int nevents);

/* ---
* Add an event to the set. Possible events are:
* - WL_LATCH_SET: Wait for the latch to be set
* - WL_POSTMASTER_DEATH: Wait for postmaster to die
* - WL_SOCKET_READABLE: Wait for socket to become readable
* can be combined in one event with WL_SOCKET_WRITEABLE
* - WL_SOCKET_WRITABLE: Wait for socket to become readable
* can be combined with WL_SOCKET_READABLE
*
* Returns the offset in WaitEventSet->events (starting from 0), which can be
* used to modify previously added wait events.
*/
extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, int fd, Latch *latch);

/*
* Change the event mask and, if applicable, the associated latch of of a
* WaitEvent.
*/
extern void ModifyWaitEvent(WaitEventSet *set, int pos, uint32 events, Latch *latch);

/*
* Wait for events added to the set to happen, or until the timeout is
* reached. At most nevents occurrent events are returned.
*
* Returns the number of events occurred, or 0 if the timeout was reached.
*/
extern int WaitEventSetWait(WaitEventSet *set, long timeout, WaitEvent* occurred_events, int nevents);

I've for now left the old latch API in place, and only converted
be-secure.c to the new style. I'd appreciate some feedback before I go
around and convert and polish everything.

Questions:
* I'm kinda inclined to merge the win32 and unix latch
implementations. There's already a fair bit in common, and this is
just going to increase that amount.

* Right now the caller has to allocate the WaitEvents he's waiting for
locally (likely on the stack), but we also could allocate them as part
of the WaitEventSet. Not sure if that'd be a benefit.

* I can do a blind rewrite of the windows implementation, but I'm
obviously not going to get that entirely right. So I need some help
from a windows person to test this.

* This approach, with a 'stateful' wait event data structure, will
actually allow to fix a couple linering bugs we have on the windows
port. C.f. /messages/by-id/4351.1336927207@sss.pgh.pa.us

- Andres

Attachments:

0001-Access-the-correct-pollfd-when-checking-for-socket-e.patchtext/x-patch; charset=us-asciiDownload

From a692a0bd6a8af7427d491adfecff10df3953a8ae Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 16 Mar 2016 10:44:04 -0700
Subject: [PATCH 1/6] Access the correct pollfd when checking for socket errors
 in the latch code.

Previously, if waiting for a latch, but not a socket, we checked
pfds[0].revents for socket errors. Even though pfds[0] wasn't actually
associated with the socket in that case.

This is currently harmless, because we check wakeEvents after the the
aforementioned check. But it's a bug waiting to be happening.
---
 src/backend/port/unix_latch.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/src/backend/port/unix_latch.c b/src/backend/port/unix_latch.c
index 2ad609c..2ffce60 100644
--- a/src/backend/port/unix_latch.c
+++ b/src/backend/port/unix_latch.c
@@ -365,13 +365,17 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 				/* socket is writable */
 				result |= WL_SOCKET_WRITEABLE;
 			}
-			if (pfds[0].revents & (POLLHUP | POLLERR | POLLNVAL))
+			if ((wakeEvents & WL_SOCKET_READABLE) &&
+				(pfds[0].revents & (POLLHUP | POLLERR | POLLNVAL)))
 			{
-				/* EOF/error condition */
-				if (wakeEvents & WL_SOCKET_READABLE)
-					result |= WL_SOCKET_READABLE;
-				if (wakeEvents & WL_SOCKET_WRITEABLE)
-					result |= WL_SOCKET_WRITEABLE;
+				/* EOF/error condition, while waiting for readable socket */
+				result |= WL_SOCKET_READABLE;
+			}
+			if ((wakeEvents & WL_SOCKET_WRITEABLE) &&
+				(pfds[0].revents & (POLLHUP | POLLERR | POLLNVAL)))
+			{
+				/* EOF/error condition, while waiting for writeable socket */
+				result |= WL_SOCKET_WRITEABLE;
 			}
 
 			/*
-- 
2.7.0.229.g701fa7f

0002-Make-it-easier-to-choose-the-used-waiting-primitive-.patchtext/x-patch; charset=us-asciiDownload

From d56555391a67fe9b4f4808b6b9f97a35c3682460 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 14 Jan 2016 14:17:43 +0100
Subject: [PATCH 2/6] Make it easier to choose the used waiting primitive in
 unix_latch.c.

This allows for easier testing of the different primitives; in
preparation for adding a new primitive.

Discussion: 20160114143931.GG10941@awork2.anarazel.de
Reviewed-By: Robert Haas
---
 src/backend/port/unix_latch.c | 50 +++++++++++++++++++++++++++++--------------
 1 file changed, 34 insertions(+), 16 deletions(-)

diff --git a/src/backend/port/unix_latch.c b/src/backend/port/unix_latch.c
index 2ffce60..93fbc9e 100644
--- a/src/backend/port/unix_latch.c
+++ b/src/backend/port/unix_latch.c
@@ -56,6 +56,22 @@
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
 
+/*
+ * Select the fd readiness primitive to use. Normally the "most modern"
+ * primitive supported by the OS will be used, but for testing it can be
+ * useful to manually specify the used primitive.  If desired, just add a
+ * define somewhere before this block.
+ */
+#if defined(LATCH_USE_POLL) || defined(LATCH_USE_SELECT)
+/* don't overwrite manual choice */
+#elif defined(HAVE_POLL)
+#define LATCH_USE_POLL
+#elif HAVE_SYS_SELECT_H
+#define LATCH_USE_SELECT
+#else
+#error "no latch implementation available"
+#endif
+
 /* Are we currently in WaitLatch? The signal handler would like to know. */
 static volatile sig_atomic_t waiting = false;
 
@@ -215,10 +231,10 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 				cur_time;
 	long		cur_timeout;
 
-#ifdef HAVE_POLL
+#if defined(LATCH_USE_POLL)
 	struct pollfd pfds[3];
 	int			nfds;
-#else
+#elif defined(LATCH_USE_SELECT)
 	struct timeval tv,
 			   *tvp;
 	fd_set		input_mask;
@@ -247,7 +263,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 		Assert(timeout >= 0 && timeout <= INT_MAX);
 		cur_timeout = timeout;
 
-#ifndef HAVE_POLL
+#ifdef LATCH_USE_SELECT
 		tv.tv_sec = cur_timeout / 1000L;
 		tv.tv_usec = (cur_timeout % 1000L) * 1000L;
 		tvp = &tv;
@@ -257,7 +273,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	{
 		cur_timeout = -1;
 
-#ifndef HAVE_POLL
+#ifdef LATCH_USE_SELECT
 		tvp = NULL;
 #endif
 	}
@@ -291,16 +307,10 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 		}
 
 		/*
-		 * Must wait ... we use poll(2) if available, otherwise select(2).
-		 *
-		 * On at least older linux kernels select(), in violation of POSIX,
-		 * doesn't reliably return a socket as writable if closed - but we
-		 * rely on that. So far all the known cases of this problem are on
-		 * platforms that also provide a poll() implementation without that
-		 * bug.  If we find one where that's not the case, we'll need to add a
-		 * workaround.
+		 * Must wait ... we use the polling interface determined at the top of
+		 * this file to do so.
 		 */
-#ifdef HAVE_POLL
+#if defined(LATCH_USE_POLL)
 		nfds = 0;
 		if (wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))
 		{
@@ -400,8 +410,16 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 					result |= WL_POSTMASTER_DEATH;
 			}
 		}
-#else							/* !HAVE_POLL */
+#elif defined(LATCH_USE_SELECT)
 
+		/*
+		 * On at least older linux kernels select(), in violation of POSIX,
+		 * doesn't reliably return a socket as writable if closed - but we
+		 * rely on that. So far all the known cases of this problem are on
+		 * platforms that also provide a poll() implementation without that
+		 * bug.  If we find one where that's not the case, we'll need to add a
+		 * workaround.
+		 */
 		FD_ZERO(&input_mask);
 		FD_ZERO(&output_mask);
 
@@ -481,7 +499,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 					result |= WL_POSTMASTER_DEATH;
 			}
 		}
-#endif   /* HAVE_POLL */
+#endif   /* LATCH_USE_SELECT */
 
 		/* If we're not done, update cur_timeout for next iteration */
 		if (result == 0 && (wakeEvents & WL_TIMEOUT))
@@ -494,7 +512,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 				/* Timeout has expired, no need to continue looping */
 				result |= WL_TIMEOUT;
 			}
-#ifndef HAVE_POLL
+#ifdef LATCH_USE_SELECT
 			else
 			{
 				tv.tv_sec = cur_timeout / 1000L;
-- 
2.7.0.229.g701fa7f

0003-Error-out-if-waiting-on-socket-readiness-without-a-s.patchtext/x-patch; charset=us-asciiDownload

From 174dc618b8b9be46c6c12f247d7129dbfb1300f4 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 14 Jan 2016 14:24:09 +0100
Subject: [PATCH 3/6] Error out if waiting on socket readiness without a
 specified socket.

Previously we just ignored such an attempt, but that seems to serve no
purpose but making things harder to debug.

Discussion: 20160114143931.GG10941@awork2.anarazel.de
    20151230173734.hx7jj2fnwyljfqek@alap3.anarazel.de
Reviewed-By: Robert Haas
---
 src/backend/port/unix_latch.c  | 7 ++++---
 src/backend/port/win32_latch.c | 6 ++++--
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/src/backend/port/unix_latch.c b/src/backend/port/unix_latch.c
index 93fbc9e..b06798f 100644
--- a/src/backend/port/unix_latch.c
+++ b/src/backend/port/unix_latch.c
@@ -242,9 +242,10 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	int			hifd;
 #endif
 
-	/* Ignore WL_SOCKET_* events if no valid socket is given */
-	if (sock == PGINVALID_SOCKET)
-		wakeEvents &= ~(WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE);
+	/* waiting for socket readiness without a socket indicates a bug */
+	if (sock == PGINVALID_SOCKET &&
+		(wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
+		elog(ERROR, "cannot wait on socket events without a socket");
 
 	Assert(wakeEvents != 0);	/* must have at least one wake event */
 
diff --git a/src/backend/port/win32_latch.c b/src/backend/port/win32_latch.c
index 80adc13..e101acf 100644
--- a/src/backend/port/win32_latch.c
+++ b/src/backend/port/win32_latch.c
@@ -119,8 +119,10 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 
 	Assert(wakeEvents != 0);	/* must have at least one wake event */
 
-	if ((wakeEvents & WL_LATCH_SET) && latch->owner_pid != MyProcPid)
-		elog(ERROR, "cannot wait on a latch owned by another process");
+	/* waiting for socket readiness without a socket indicates a bug */
+	if (sock == PGINVALID_SOCKET &&
+		(wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
+		elog(ERROR, "cannot wait on socket events without a socket");
 
 	/*
 	 * Initialize timeout if requested.  We must record the current time so
-- 
2.7.0.229.g701fa7f

0004-Only-clear-unix_latch.c-s-self-pipe-if-it-actually-c.patchtext/x-patch; charset=us-asciiDownload

From a8c526070791c6db27983aed6e0f1f9b9ed2554c Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 14 Jan 2016 15:15:17 +0100
Subject: [PATCH 4/6] Only clear unix_latch.c's self-pipe if it actually
 contains data.

This avoids a good number of, individually quite fast, system calls in
scenarios with many quick queries. Besides the aesthetic benefit of
seing fewer superflous system calls with strace, it also improves
performance by ~2% measured by pgbench -M prepared -c 96 -j 8 -S (scale
100).
---
 src/backend/port/unix_latch.c | 79 ++++++++++++++++++++++++++++---------------
 1 file changed, 52 insertions(+), 27 deletions(-)

diff --git a/src/backend/port/unix_latch.c b/src/backend/port/unix_latch.c
index b06798f..f6cb15b 100644
--- a/src/backend/port/unix_latch.c
+++ b/src/backend/port/unix_latch.c
@@ -283,27 +283,29 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	do
 	{
 		/*
-		 * Clear the pipe, then check if the latch is set already. If someone
-		 * sets the latch between this and the poll()/select() below, the
-		 * setter will write a byte to the pipe (or signal us and the signal
-		 * handler will do that), and the poll()/select() will return
-		 * immediately.
+		 * Check if the latch is set already. If so, leave loop immediately,
+		 * avoid blocking again. We don't attempt to report any other events
+		 * that might also be satisfied.
+		 *
+		 * If someone sets the latch between this and the poll()/select()
+		 * below, the setter will write a byte to the pipe (or signal us and
+		 * the signal handler will do that), and the poll()/select() will
+		 * return immediately.
+		 *
+		 * If there's a pending byte in the self pipe, we'll notice whenever
+		 * blocking. Only clearing the pipe in that case avoids having to
+		 * drain it everytime WaitLatchOrSocket() is used. Should the
+		 * pipe-buffer fill up in some scenarios - widly unlikely - we're
+		 * still ok, because the pipe is in nonblocking mode.
 		 *
 		 * Note: we assume that the kernel calls involved in drainSelfPipe()
 		 * and SetLatch() will provide adequate synchronization on machines
 		 * with weak memory ordering, so that we cannot miss seeing is_set if
 		 * the signal byte is already in the pipe when we drain it.
 		 */
-		drainSelfPipe();
-
 		if ((wakeEvents & WL_LATCH_SET) && latch->is_set)
 		{
 			result |= WL_LATCH_SET;
-
-			/*
-			 * Leave loop immediately, avoid blocking again. We don't attempt
-			 * to report any other events that might also be satisfied.
-			 */
 			break;
 		}
 
@@ -313,24 +315,26 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 		 */
 #if defined(LATCH_USE_POLL)
 		nfds = 0;
+
+		/* selfpipe is always in pfds[0] */
+		pfds[0].fd = selfpipe_readfd;
+		pfds[0].events = POLLIN;
+		pfds[0].revents = 0;
+		nfds++;
+
 		if (wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))
 		{
-			/* socket, if used, is always in pfds[0] */
-			pfds[0].fd = sock;
-			pfds[0].events = 0;
+			/* socket, if used, is always in pfds[1] */
+			pfds[1].fd = sock;
+			pfds[1].events = 0;
 			if (wakeEvents & WL_SOCKET_READABLE)
-				pfds[0].events |= POLLIN;
+				pfds[1].events |= POLLIN;
 			if (wakeEvents & WL_SOCKET_WRITEABLE)
-				pfds[0].events |= POLLOUT;
-			pfds[0].revents = 0;
+				pfds[1].events |= POLLOUT;
+			pfds[1].revents = 0;
 			nfds++;
 		}
 
-		pfds[nfds].fd = selfpipe_readfd;
-		pfds[nfds].events = POLLIN;
-		pfds[nfds].revents = 0;
-		nfds++;
-
 		if (wakeEvents & WL_POSTMASTER_DEATH)
 		{
 			/* postmaster fd, if used, is always in pfds[nfds - 1] */
@@ -364,26 +368,33 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 		else
 		{
 			/* at least one event occurred, so check revents values */
+
+			if (pfds[0].revents & POLLIN)
+			{
+				/* There's data in the self-pipe, clear it. */
+				drainSelfPipe();
+			}
+
 			if ((wakeEvents & WL_SOCKET_READABLE) &&
-				(pfds[0].revents & POLLIN))
+				(pfds[1].revents & POLLIN))
 			{
 				/* data available in socket, or EOF/error condition */
 				result |= WL_SOCKET_READABLE;
 			}
 			if ((wakeEvents & WL_SOCKET_WRITEABLE) &&
-				(pfds[0].revents & POLLOUT))
+				(pfds[1].revents & POLLOUT))
 			{
 				/* socket is writable */
 				result |= WL_SOCKET_WRITEABLE;
 			}
 			if ((wakeEvents & WL_SOCKET_READABLE) &&
-				(pfds[0].revents & (POLLHUP | POLLERR | POLLNVAL)))
+				(pfds[1].revents & (POLLHUP | POLLERR | POLLNVAL)))
 			{
 				/* EOF/error condition, while waiting for readable socket */
 				result |= WL_SOCKET_READABLE;
 			}
 			if ((wakeEvents & WL_SOCKET_WRITEABLE) &&
-				(pfds[0].revents & (POLLHUP | POLLERR | POLLNVAL)))
+				(pfds[1].revents & (POLLHUP | POLLERR | POLLNVAL)))
 			{
 				/* EOF/error condition, while waiting for writeable socket */
 				result |= WL_SOCKET_WRITEABLE;
@@ -472,6 +483,11 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 		else
 		{
 			/* at least one event occurred, so check masks */
+			if (FD_ISSET(selfpipe_readfd, &input_mask))
+			{
+				/* There's data in the self-pipe, clear it. */
+				drainSelfPipe();
+			}
 			if ((wakeEvents & WL_SOCKET_READABLE) && FD_ISSET(sock, &input_mask))
 			{
 				/* data available in socket, or EOF */
@@ -502,6 +518,15 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 		}
 #endif   /* LATCH_USE_SELECT */
 
+		/*
+		 * Check again wether latch is set, the arrival of a signal/self-byte
+		 * might be what stopped our sleep. It's not required for correctness
+		 * to signal the latch as being set (we'd just loop if there's no
+		 * other event), but it seems good to report an arrived latch asap.
+		 */
+		if ((wakeEvents & WL_LATCH_SET) && latch->is_set)
+			result |= WL_LATCH_SET;
+
 		/* If we're not done, update cur_timeout for next iteration */
 		if (result == 0 && (wakeEvents & WL_TIMEOUT))
 		{
-- 
2.7.0.229.g701fa7f

0005-WIP-WaitEvent-API.patchtext/x-patch; charset=us-asciiDownload

From bdb1101bc5b2fd0dd3efa6484f08fa4a1bb93b0c Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 16 Mar 2016 17:28:05 -0700
Subject: [PATCH 5/6] WIP: WaitEvent API

---
 src/backend/libpq/be-secure.c     |  24 +--
 src/backend/libpq/pqcomm.c        |  12 ++
 src/backend/port/unix_latch.c     | 317 ++++++++++++++++++++++++++++++++++++++
 src/backend/utils/init/miscinit.c |   8 +
 src/include/libpq/libpq.h         |   3 +
 src/include/storage/latch.h       |  44 ++++++
 6 files changed, 396 insertions(+), 12 deletions(-)

diff --git a/src/backend/libpq/be-secure.c b/src/backend/libpq/be-secure.c
index ac709d1..c396811 100644
--- a/src/backend/libpq/be-secure.c
+++ b/src/backend/libpq/be-secure.c
@@ -140,13 +140,13 @@ retry:
 	/* In blocking mode, wait until the socket is ready */
 	if (n < 0 && !port->noblock && (errno == EWOULDBLOCK || errno == EAGAIN))
 	{
-		int			w;
+		WaitEvent   event;
 
 		Assert(waitfor);
 
-		w = WaitLatchOrSocket(MyLatch,
-							  WL_LATCH_SET | WL_POSTMASTER_DEATH | waitfor,
-							  port->sock, 0);
+		ModifyWaitEvent(FeBeWaitSet, 0, waitfor, NULL);
+
+		WaitEventSetWait(FeBeWaitSet, 0 /* no timeout */, &event, 1);
 
 		/*
 		 * If the postmaster has died, it's not safe to continue running,
@@ -165,13 +165,13 @@ retry:
 		 * cycles checking for this very rare condition, and this should cause
 		 * us to exit quickly in most cases.)
 		 */
-		if (w & WL_POSTMASTER_DEATH)
+		if (event.events & WL_POSTMASTER_DEATH)
 			ereport(FATAL,
 					(errcode(ERRCODE_ADMIN_SHUTDOWN),
 					errmsg("terminating connection due to unexpected postmaster exit")));
 
 		/* Handle interrupt. */
-		if (w & WL_LATCH_SET)
+		if (event.events & WL_LATCH_SET)
 		{
 			ResetLatch(MyLatch);
 			ProcessClientReadInterrupt(true);
@@ -241,22 +241,22 @@ retry:
 
 	if (n < 0 && !port->noblock && (errno == EWOULDBLOCK || errno == EAGAIN))
 	{
-		int			w;
+		WaitEvent   event;
 
 		Assert(waitfor);
 
-		w = WaitLatchOrSocket(MyLatch,
-							  WL_LATCH_SET | WL_POSTMASTER_DEATH | waitfor,
-							  port->sock, 0);
+		ModifyWaitEvent(FeBeWaitSet, 0, waitfor, NULL);
+
+		WaitEventSetWait(FeBeWaitSet, 0 /* no timeout */, &event, 1);
 
 		/* See comments in secure_read. */
-		if (w & WL_POSTMASTER_DEATH)
+		if (event.events & WL_POSTMASTER_DEATH)
 			ereport(FATAL,
 					(errcode(ERRCODE_ADMIN_SHUTDOWN),
 					errmsg("terminating connection due to unexpected postmaster exit")));
 
 		/* Handle interrupt. */
-		if (w & WL_LATCH_SET)
+		if (event.events & WL_LATCH_SET)
 		{
 			ResetLatch(MyLatch);
 			ProcessClientWriteInterrupt(true);
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 71473db..31d646d 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -201,6 +201,18 @@ pq_init(void)
 				(errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
+	{
+		MemoryContext oldcontext;
+
+		oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+
+		FeBeWaitSet = CreateWaitEventSet(3);
+		AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock, NULL);
+		AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch);
+		AddWaitEventToSet(FeBeWaitSet, WL_POSTMASTER_DEATH, -1, NULL);
+
+		MemoryContextSwitchTo(oldcontext);
+	}
 }
 
 /* --------------------------------
diff --git a/src/backend/port/unix_latch.c b/src/backend/port/unix_latch.c
index f6cb15b..9bcfe14 100644
--- a/src/backend/port/unix_latch.c
+++ b/src/backend/port/unix_latch.c
@@ -72,6 +72,28 @@
 #error "no latch implementation available"
 #endif
 
+#if defined(WAIT_USE_POLL) || defined(WAIT_USE_SELECT)
+/* don't overwrite manual choice */
+#elif defined(HAVE_POLL)
+#define WAIT_USE_POLL
+#elif HAVE_SYS_SELECT_H
+#define WAIT_USE_SELECT
+#else
+#error "no wait set implementation available"
+#endif
+
+typedef struct WaitEventSet
+{
+	int nevents;
+	int nevents_space;
+	Latch *latch;
+	int latch_pos;
+	WaitEvent *events;
+#if defined(WAIT_USE_POLL)
+	struct pollfd *pollfds;
+#endif
+} WaitEventSet;
+
 /* Are we currently in WaitLatch? The signal handler would like to know. */
 static volatile sig_atomic_t waiting = false;
 
@@ -83,6 +105,9 @@ static int	selfpipe_writefd = -1;
 static void sendSelfPipeByte(void);
 static void drainSelfPipe(void);
 
+#if defined(WAIT_USE_POLL)
+static void WaitEventAdjustPoll(WaitEventSet *set, WaitEvent *event);
+#endif
 
 /*
  * Initialize the process-local latch infrastructure.
@@ -636,6 +661,298 @@ ResetLatch(volatile Latch *latch)
 	pg_memory_barrier();
 }
 
+WaitEventSet *
+CreateWaitEventSet(int nevents)
+{
+	WaitEventSet   *set;
+	char		   *data;
+	Size			sz = 0;
+
+	sz += sizeof(WaitEventSet);
+	sz += sizeof(WaitEvent) * nevents;
+
+#if defined(LATCH_USE_POLL)
+	sz += sizeof(struct pollfd) * nevents;
+#endif
+
+	data = (char *) palloc0(sz);
+
+	set = (WaitEventSet *) data;
+	data += sizeof(WaitEventSet);
+
+	set->events = (WaitEvent *) data;
+	data += sizeof(WaitEvent) * nevents;
+
+#if defined(WAIT_USE_POLL)
+	set->pollfds = (struct pollfd *) data;
+	data += sizeof(struct pollfd) * nevents;
+#endif
+
+	set->latch = NULL;
+	set->nevents_space = nevents;
+
+	return set;
+}
+
+int
+AddWaitEventToSet(WaitEventSet *set, uint32 events, int fd, Latch *latch)
+{
+	WaitEvent *event;
+
+	if (set->nevents_space <= set->nevents)
+		elog(ERROR, "no space for yet another event");
+
+	if (set->latch && latch)
+		elog(ERROR, "can only wait for one latch");
+	if (!latch && (events & WL_LATCH_SET))
+		elog(ERROR, "cannot wait on latch without latch");
+
+	/* FIXME: validate event mask */
+
+	event = &set->events[set->nevents];
+	event->pos = set->nevents++;
+	event->fd = fd;
+	event->events = events;
+
+	if (events == WL_LATCH_SET)
+	{
+		set->latch = latch;
+		set->latch_pos = event->pos;
+		event->fd = selfpipe_readfd;
+	}
+	else if (events == WL_POSTMASTER_DEATH)
+	{
+		event->fd = postmaster_alive_fds[POSTMASTER_FD_WATCH];
+	}
+
+#if defined(WAIT_USE_POLL)
+	WaitEventAdjustPoll(set, event);
+#endif
+
+	return event->pos;
+}
+
+void
+ModifyWaitEvent(WaitEventSet *set, int pos, uint32 events, Latch *latch)
+{
+	WaitEvent *event;
+
+	Assert(pos < set->nevents);
+
+	event = &set->events[pos];
+
+	/* no need to perform any checks/modifications */
+	if (events == event->events && !(event->events & WL_LATCH_SET))
+		return;
+
+	if (event->events & WL_LATCH_SET &&
+		events != event->events)
+	{
+		/* we could allow to disable latch events for a while */
+		elog(ERROR, "cannot modify latch event");
+	}
+	if (event->events & WL_POSTMASTER_DEATH &&
+		events != event->events)
+	{
+		elog(ERROR, "cannot modify postmaster death event");
+	}
+
+	/* FIXME: validate event mask */
+	event->events = events;
+
+	if (events == WL_LATCH_SET)
+	{
+		set->latch = latch;
+	}
+
+#if defined(WAIT_USE_POLL)
+	WaitEventAdjustPoll(set, event);
+#endif
+}
+
+#if defined(WAIT_USE_POLL)
+static void
+WaitEventAdjustPoll(WaitEventSet *set, WaitEvent *event)
+{
+	struct pollfd *pollfd = &set->pollfds[event->pos];
+
+	pollfd->revents = 0;
+	pollfd->fd = event->fd;
+
+	/* prepare pollfd entry once */
+	if (event->events == WL_LATCH_SET)
+	{
+		Assert(set->latch != NULL);
+		pollfd->events = POLLIN;
+	}
+	else if (event->events == WL_POSTMASTER_DEATH)
+	{
+		pollfd->events = POLLIN;
+	}
+	else
+	{
+		Assert(event->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE));
+		pollfd->events = 0;
+		if (event->events & WL_SOCKET_READABLE)
+			pollfd->events |= POLLIN;
+		if (event->events & WL_SOCKET_WRITEABLE)
+			pollfd->events |= POLLOUT;
+	}
+
+	Assert(event->fd >= 0);
+}
+#endif
+
+#if defined(WAIT_USE_POLL)
+int
+WaitEventSetWait(WaitEventSet *set, long timeout,
+				 WaitEvent* occurred_events, int nevents)
+{
+	int returned_events = 0;
+	instr_time	start_time,
+				cur_time;
+	long		cur_timeout = -1;
+	WaitEvent *cur_event;
+
+	struct pollfd *pfds = set->pollfds;
+
+	Assert(nevents > 0);
+
+	if (timeout)
+	{
+		INSTR_TIME_SET_CURRENT(start_time);
+		Assert(timeout >= 0 && timeout <= INT_MAX);
+		cur_timeout = timeout;
+	}
+
+	waiting = true;
+	while (returned_events == 0)
+	{
+		int rc;
+		int pos;
+		struct pollfd *cur_pollfd;
+
+		/* return immediately if latch is set */
+		if (set->latch && set->latch->is_set)
+		{
+			occurred_events->fd = -1;
+			occurred_events->pos = set->latch_pos;
+			occurred_events->events = WL_LATCH_SET;
+			occurred_events++;
+			returned_events++;
+
+			continue;
+		}
+
+		/* Sleep */
+		rc = poll(pfds, set->nevents, (int) cur_timeout);
+
+		/* Check return code */
+		if (rc < 0)
+		{
+			/* EINTR is okay, otherwise complain */
+			if (errno != EINTR)
+			{
+				waiting = false;
+				ereport(ERROR,
+						(errcode_for_socket_access(),
+						 errmsg("poll() failed: %m")));
+			}
+			continue;
+		}
+		else if (rc == 0)
+		{
+			break;
+		}
+
+		for (pos = 0, cur_event = set->events, cur_pollfd = set->pollfds;
+			 pos < set->nevents && returned_events < nevents;
+			 pos++, cur_event++, cur_pollfd++)
+		{
+			if (cur_event->events == WL_LATCH_SET &&
+				(cur_pollfd->revents & (POLLIN | POLLHUP | POLLERR | POLLNVAL)))
+			{
+				/* There's data in the self-pipe, clear it. */
+				drainSelfPipe();
+
+				if (set->latch->is_set)
+				{
+					occurred_events->fd = -1;
+					occurred_events->pos = cur_event->pos;
+					occurred_events->events = WL_LATCH_SET;
+					occurred_events++;
+				}
+			}
+			else if (cur_event->events == WL_POSTMASTER_DEATH &&
+					 (cur_pollfd->revents & (POLLIN | POLLHUP | POLLERR | POLLNVAL)))
+			{
+				/*
+				 * According to the select(2) man page on Linux, select(2) may
+				 * spuriously return and report a file descriptor as readable,
+				 * when it's not; and presumably so can poll(2).  It's not
+				 * clear that the relevant cases would ever apply to the
+				 * postmaster pipe, but since the consequences of falsely
+				 * returning WL_POSTMASTER_DEATH could be pretty unpleasant,
+				 * we take the trouble to positively verify EOF with
+				 * PostmasterIsAlive().
+				 */
+				if (!PostmasterIsAlive())
+				{
+					occurred_events->fd = -1;
+					occurred_events->pos = cur_event->pos;
+					occurred_events->events = WL_POSTMASTER_DEATH;
+					occurred_events++;
+					returned_events++;
+				}
+			}
+			else if (cur_event->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))
+			{
+				Assert(cur_event->fd);
+
+				occurred_events->fd = cur_event->fd;
+				occurred_events->pos = cur_event->pos;
+				occurred_events->events = 0;
+
+				if ((cur_event->events & WL_SOCKET_READABLE) &&
+					(cur_pollfd->revents & (POLLIN | POLLHUP | POLLERR | POLLNVAL)))
+				{
+					occurred_events->events |= WL_SOCKET_READABLE;
+				}
+
+				if ((cur_event->events & WL_SOCKET_WRITEABLE) &&
+					(cur_pollfd->revents & (POLLOUT | POLLHUP | POLLERR | POLLNVAL)))
+				{
+					occurred_events->events |= WL_SOCKET_WRITEABLE;
+					occurred_events++;
+					returned_events++;
+				}
+
+				if (occurred_events->events != 0)
+				{
+					occurred_events++;
+					returned_events++;
+				}
+			}
+		}
+
+
+		if (occurred_events == 0 && timeout != 0)
+		{
+			INSTR_TIME_SET_CURRENT(cur_time);
+			INSTR_TIME_SUBTRACT(cur_time, start_time);
+			cur_timeout = timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
+			if (cur_timeout <= 0)
+				goto out;
+		}
+	}
+out:
+	waiting = false;
+
+	return returned_events;
+}
+#elif defined(WAIT_USE_SELECT)
+#endif
+
 /*
  * SetLatch uses SIGUSR1 to wake up the process waiting on the latch.
  *
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 18f5e6f..d13355b 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -33,6 +33,7 @@
 
 #include "access/htup_details.h"
 #include "catalog/pg_authid.h"
+#include "libpq/libpq.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "postmaster/autovacuum.h"
@@ -247,6 +248,9 @@ SwitchToSharedLatch(void)
 
 	MyLatch = &MyProc->procLatch;
 
+	if (FeBeWaitSet)
+		ModifyWaitEvent(FeBeWaitSet, 1, WL_LATCH_SET, MyLatch);
+
 	/*
 	 * Set the shared latch as the local one might have been set. This
 	 * shouldn't normally be necessary as code is supposed to check the
@@ -262,6 +266,10 @@ SwitchBackToLocalLatch(void)
 	Assert(MyProc != NULL && MyLatch == &MyProc->procLatch);
 
 	MyLatch = &LocalLatchData;
+
+	if (FeBeWaitSet)
+		ModifyWaitEvent(FeBeWaitSet, 1, WL_LATCH_SET, MyLatch);
+
 	SetLatch(MyLatch);
 }
 
diff --git a/src/include/libpq/libpq.h b/src/include/libpq/libpq.h
index 0569994..109fdf7 100644
--- a/src/include/libpq/libpq.h
+++ b/src/include/libpq/libpq.h
@@ -19,6 +19,7 @@
 
 #include "lib/stringinfo.h"
 #include "libpq/libpq-be.h"
+#include "storage/latch.h"
 
 
 typedef struct
@@ -95,6 +96,8 @@ extern ssize_t secure_raw_write(Port *port, const void *ptr, size_t len);
 
 extern bool ssl_loaded_verify_locations;
 
+WaitEventSet *FeBeWaitSet;
+
 /* GUCs */
 extern char *SSLCipherSuites;
 extern char *SSLECDHCurve;
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index e77491e..941e2f0 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -102,6 +102,50 @@ typedef struct Latch
 #define WL_TIMEOUT			 (1 << 3)
 #define WL_POSTMASTER_DEATH  (1 << 4)
 
+typedef struct WaitEventSet WaitEventSet;
+
+typedef struct WaitEvent
+{
+	int		pos;		/* position in the event data structure */
+	uint32	events;		/* tripped events */
+	int		fd;			/* fd associated with event */
+} WaitEvent;
+
+/*
+ * Create a WaitEventSet with space for nevents different events to wait for.
+ *
+ * latch may be NULL.
+ */
+extern WaitEventSet *CreateWaitEventSet(int nevents);
+
+/* ---
+ * Add an event to the set. Possible events are:
+ * - WL_LATCH_SET: Wait for the latch to be set
+ * - WL_POSTMASTER_DEATH: Wait for postmaster to die
+ * - WL_SOCKET_READABLE: Wait for socket to become readable
+ *   can be combined in one event with WL_SOCKET_WRITEABLE
+ * - WL_SOCKET_WRITABLE: Wait for socket to become readable
+ *   can be combined with WL_SOCKET_READABLE
+ *
+ * Returns the offset in WaitEventSet->events (starting from 0), which can be
+ * used to modify previously added wait events.
+ */
+extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, int fd, Latch *latch);
+
+/*
+ * Change the event mask and, if applicable, the associated latch of of a
+ * WaitEvent.
+ */
+extern void ModifyWaitEvent(WaitEventSet *set, int pos, uint32 events, Latch *latch);
+
+/*
+ * Wait for events added to the set to happen, or until the timeout is
+ * reached.  At most nevents occurrent events are returned.
+ *
+ * Returns the number of events occurred, or 0 if the timeout was reached.
+ */
+extern int WaitEventSetWait(WaitEventSet *set, long timeout, WaitEvent* occurred_events, int nevents);
+
 /*
  * prototypes for functions in latch.c
  */
-- 
2.7.0.229.g701fa7f

0006-WIP-Use-epoll-for-Wait-Event-API-if-available.patchtext/x-patch; charset=us-asciiDownload

From 3feafa5ecd6666aacbaf3ceda466044476dda63a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 16 Mar 2016 18:14:18 -0700
Subject: [PATCH 6/6] WIP: Use epoll for Wait Event API if available.

---
 configure                     |   2 +-
 configure.in                  |   2 +-
 src/backend/port/unix_latch.c | 238 ++++++++++++++++++++++++++++++++++++++++--
 src/include/pg_config.h.in    |   3 +
 4 files changed, 235 insertions(+), 10 deletions(-)

diff --git a/configure b/configure
index a45be67..da897ae 100755
--- a/configure
+++ b/configure
@@ -10193,7 +10193,7 @@ fi
 ## Header files
 ##
 
-for ac_header in atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h pwd.h sys/ioctl.h sys/ipc.h sys/poll.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/socket.h sys/sockio.h sys/tas.h sys/time.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h
+for ac_header in atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h pwd.h sys/epoll.h sys/ioctl.h sys/ipc.h sys/poll.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/socket.h sys/sockio.h sys/tas.h sys/time.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h
 do :
   as_ac_Header=`$as_echo "ac_cv_header_$ac_header" | $as_tr_sh`
 ac_fn_c_check_header_mongrel "$LINENO" "$ac_header" "$as_ac_Header" "$ac_includes_default"
diff --git a/configure.in b/configure.in
index c298926..dee3c45 100644
--- a/configure.in
+++ b/configure.in
@@ -1183,7 +1183,7 @@ AC_SUBST(UUID_LIBS)
 ##
 
 dnl sys/socket.h is required by AC_FUNC_ACCEPT_ARGTYPES
-AC_CHECK_HEADERS([atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h pwd.h sys/ioctl.h sys/ipc.h sys/poll.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/socket.h sys/sockio.h sys/tas.h sys/time.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h])
+AC_CHECK_HEADERS([atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h pwd.h sys/epoll.h sys/ioctl.h sys/ipc.h sys/poll.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/socket.h sys/sockio.h sys/tas.h sys/time.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h])
 
 # On BSD, test for net/if.h will fail unless sys/socket.h
 # is included first.
diff --git a/src/backend/port/unix_latch.c b/src/backend/port/unix_latch.c
index 9bcfe14..c233d68 100644
--- a/src/backend/port/unix_latch.c
+++ b/src/backend/port/unix_latch.c
@@ -38,6 +38,9 @@
 #include <unistd.h>
 #include <sys/time.h>
 #include <sys/types.h>
+#ifdef HAVE_SYS_EPOLL_H
+#include <sys/epoll.h>
+#endif
 #ifdef HAVE_POLL_H
 #include <poll.h>
 #endif
@@ -72,8 +75,10 @@
 #error "no latch implementation available"
 #endif
 
-#if defined(WAIT_USE_POLL) || defined(WAIT_USE_SELECT)
+#if defined(WAIT_USE_EPOLL) || defined(WAIT_USE_POLL) || defined(WAIT_USE_SELECT)
 /* don't overwrite manual choice */
+#elif defined(HAVE_SYS_EPOLL_H)
+#define WAIT_USE_EPOLL
 #elif defined(HAVE_POLL)
 #define WAIT_USE_POLL
 #elif HAVE_SYS_SELECT_H
@@ -89,7 +94,10 @@ typedef struct WaitEventSet
 	Latch *latch;
 	int latch_pos;
 	WaitEvent *events;
-#if defined(WAIT_USE_POLL)
+#if defined(WAIT_USE_EPOLL)
+	struct epoll_event *epoll_ret_events;
+	int epoll_fd;
+#elif defined(WAIT_USE_POLL)
 	struct pollfd *pollfds;
 #endif
 } WaitEventSet;
@@ -105,7 +113,9 @@ static int	selfpipe_writefd = -1;
 static void sendSelfPipeByte(void);
 static void drainSelfPipe(void);
 
-#if defined(WAIT_USE_POLL)
+#if defined(WAIT_USE_EPOLL)
+static void WaitEventAdjustEpoll(WaitEventSet *set, WaitEvent *event, int action);
+#elif defined(WAIT_USE_POLL)
 static void WaitEventAdjustPoll(WaitEventSet *set, WaitEvent *event);
 #endif
 
@@ -671,7 +681,9 @@ CreateWaitEventSet(int nevents)
 	sz += sizeof(WaitEventSet);
 	sz += sizeof(WaitEvent) * nevents;
 
-#if defined(LATCH_USE_POLL)
+#if defined(WAIT_USE_EPOLL)
+	sz += sizeof(struct epoll_event) * nevents;
+#elif defined(WAIT_USE_POLL)
 	sz += sizeof(struct pollfd) * nevents;
 #endif
 
@@ -683,7 +695,10 @@ CreateWaitEventSet(int nevents)
 	set->events = (WaitEvent *) data;
 	data += sizeof(WaitEvent) * nevents;
 
-#if defined(WAIT_USE_POLL)
+#if defined(WAIT_USE_EPOLL)
+	set->epoll_ret_events = (struct epoll_event *) data;
+	data += sizeof(struct epoll_event) * nevents;
+#elif defined(WAIT_USE_POLL)
 	set->pollfds = (struct pollfd *) data;
 	data += sizeof(struct pollfd) * nevents;
 #endif
@@ -691,6 +706,12 @@ CreateWaitEventSet(int nevents)
 	set->latch = NULL;
 	set->nevents_space = nevents;
 
+#if defined(WAIT_USE_EPOLL)
+	set->epoll_fd = epoll_create(nevents);
+	if (set->epoll_fd < 0)
+		elog(ERROR, "epoll_create failed: %m");
+#endif
+
 	return set;
 }
 
@@ -725,7 +746,9 @@ AddWaitEventToSet(WaitEventSet *set, uint32 events, int fd, Latch *latch)
 		event->fd = postmaster_alive_fds[POSTMASTER_FD_WATCH];
 	}
 
-#if defined(WAIT_USE_POLL)
+#if defined(WAIT_USE_EPOLL)
+	WaitEventAdjustEpoll(set, event, EPOLL_CTL_ADD);
+#elif defined(WAIT_USE_POLL)
 	WaitEventAdjustPoll(set, event);
 #endif
 
@@ -765,11 +788,59 @@ ModifyWaitEvent(WaitEventSet *set, int pos, uint32 events, Latch *latch)
 		set->latch = latch;
 	}
 
-#if defined(WAIT_USE_POLL)
+#if defined(WAIT_USE_EPOLL)
+	WaitEventAdjustEpoll(set, event, EPOLL_CTL_MOD);
+#elif defined(WAIT_USE_POLL)
 	WaitEventAdjustPoll(set, event);
 #endif
 }
 
+#if defined(WAIT_USE_EPOLL)
+static void
+WaitEventAdjustEpoll(WaitEventSet *set, WaitEvent *event, int action)
+{
+	struct epoll_event epoll_ev;
+	int rc;
+
+	epoll_ev.events = EPOLLERR | EPOLLHUP;
+	/* pointer to our event, returned by epoll_wait */
+	epoll_ev.data.ptr = event;
+
+	/* prepare pollfd entry once */
+	if (event->events == WL_LATCH_SET)
+	{
+		Assert(set->latch != NULL);
+		epoll_ev.events |= EPOLLIN;
+	}
+	else if (event->events == WL_POSTMASTER_DEATH)
+	{
+		epoll_ev.events |= EPOLLIN;
+	}
+	else
+	{
+		Assert(event->fd >= 0);
+		Assert(event->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE));
+
+		if (event->events & WL_SOCKET_READABLE)
+			epoll_ev.events |= EPOLLIN;
+		if (event->events & WL_SOCKET_WRITEABLE)
+			epoll_ev.events |= EPOLLOUT;
+	}
+
+	/*
+	 * Even though unused, we also poss epoll_ev as the data argument for
+	 * EPOLL_CTL_DELETE.  There used to be an epoll bug requiring that, and it
+	 * makes the code simpler...
+	 */
+	rc = epoll_ctl(set->epoll_fd, action, event->fd, &epoll_ev);
+
+	if (rc < 0)
+		ereport(ERROR,
+				(errcode_for_socket_access(),
+				 errmsg("epoll_ctl() failed: %m")));
+}
+#endif
+
 #if defined(WAIT_USE_POLL)
 static void
 WaitEventAdjustPoll(WaitEventSet *set, WaitEvent *event)
@@ -803,7 +874,158 @@ WaitEventAdjustPoll(WaitEventSet *set, WaitEvent *event)
 }
 #endif
 
-#if defined(WAIT_USE_POLL)
+#if defined(WAIT_USE_EPOLL)
+int
+WaitEventSetWait(WaitEventSet *set, long timeout,
+				 WaitEvent* occurred_events, int nevents)
+{
+	int returned_events = 0;
+	instr_time	start_time,
+				cur_time;
+	long		cur_timeout = -1;
+	WaitEvent *cur_event;
+
+	Assert(nevents > 0);
+
+	if (timeout)
+	{
+		INSTR_TIME_SET_CURRENT(start_time);
+		Assert(timeout >= 0 && timeout <= INT_MAX);
+		cur_timeout = timeout;
+	}
+
+	waiting = true;
+	while (returned_events == 0)
+	{
+		int rc;
+		int pos;
+		struct epoll_event *cur_epoll_event;
+
+		/* return immediately if latch is set */
+		if (set->latch && set->latch->is_set)
+		{
+			occurred_events->fd = -1;
+			occurred_events->pos = set->latch_pos;
+			occurred_events->events = WL_LATCH_SET;
+			occurred_events++;
+			returned_events++;
+
+			continue;
+		}
+
+		/* Sleep */
+		rc = epoll_wait(set->epoll_fd, set->epoll_ret_events,
+						nevents, cur_timeout);
+
+		/* Check return code */
+		if (rc < 0)
+		{
+			/* EINTR is okay, otherwise complain */
+			if (errno != EINTR)
+			{
+				waiting = false;
+				ereport(ERROR,
+						(errcode_for_socket_access(),
+						 errmsg("poll() failed: %m")));
+			}
+			continue;
+		}
+		else if (rc == 0)
+		{
+			break;
+		}
+
+		/* iterate over the returned epoll events */
+		for (pos = 0, cur_epoll_event = set->epoll_ret_events;
+			 pos < rc && returned_events < nevents;
+			 pos++, cur_epoll_event++)
+		{
+			cur_event = (WaitEvent *) cur_epoll_event->data.ptr;
+
+			if (cur_event->events == WL_LATCH_SET &&
+				cur_epoll_event->events & (EPOLLIN | EPOLLERR | EPOLLHUP))
+			{
+				/* There's data in the self-pipe, clear it. */
+				drainSelfPipe();
+
+				if (set->latch->is_set)
+				{
+					occurred_events->fd = -1;
+					occurred_events->pos = cur_event->pos;
+					occurred_events->events = WL_LATCH_SET;
+					occurred_events++;
+				}
+			}
+			else if (cur_event->events == WL_POSTMASTER_DEATH &&
+					 cur_epoll_event->events & (EPOLLIN | EPOLLERR | EPOLLHUP))
+			{
+				/*
+				 * FIXME:
+				 * According to the select(2) man page on Linux, select(2) may
+				 * spuriously return and report a file descriptor as readable,
+				 * when it's not; and presumably so can poll(2).  It's not
+				 * clear that the relevant cases would ever apply to the
+				 * postmaster pipe, but since the consequences of falsely
+				 * returning WL_POSTMASTER_DEATH could be pretty unpleasant,
+				 * we take the trouble to positively verify EOF with
+				 * PostmasterIsAlive().
+				 */
+				if (!PostmasterIsAlive())
+				{
+					occurred_events->fd = -1;
+					occurred_events->pos = cur_event->pos;
+					occurred_events->events = WL_POSTMASTER_DEATH;
+					occurred_events++;
+					returned_events++;
+				}
+			}
+			else if (cur_event->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))
+			{
+				Assert(cur_event->fd);
+
+				occurred_events->fd = cur_event->fd;
+				occurred_events->pos = cur_event->pos;
+				occurred_events->events = 0;
+
+				if ((cur_event->events & WL_SOCKET_READABLE) &&
+					(cur_epoll_event->events & (EPOLLIN | EPOLLERR | EPOLLHUP)))
+				{
+					occurred_events->events |= WL_SOCKET_READABLE;
+				}
+
+				if ((cur_event->events & WL_SOCKET_WRITEABLE) &&
+					(cur_epoll_event->events & (EPOLLOUT | EPOLLERR | EPOLLHUP)))
+				{
+					occurred_events->events |= WL_SOCKET_WRITEABLE;
+					occurred_events++;
+					returned_events++;
+				}
+
+				if (occurred_events->events != 0)
+				{
+					occurred_events++;
+					returned_events++;
+				}
+			}
+		}
+
+
+		if (occurred_events == 0 && timeout != 0)
+		{
+			INSTR_TIME_SET_CURRENT(cur_time);
+			INSTR_TIME_SUBTRACT(cur_time, start_time);
+			cur_timeout = timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
+			if (cur_timeout <= 0)
+				goto out;
+		}
+	}
+out:
+	waiting = false;
+
+	return returned_events;
+}
+
+#elif defined(WAIT_USE_POLL)
 int
 WaitEventSetWait(WaitEventSet *set, long timeout,
 				 WaitEvent* occurred_events, int nevents)
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 3813226..c72635c 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -530,6 +530,9 @@
 /* Define to 1 if you have the syslog interface. */
 #undef HAVE_SYSLOG
 
+/* Define to 1 if you have the <sys/epoll.h> header file. */
+#undef HAVE_SYS_EPOLL_H
+
 /* Define to 1 if you have the <sys/ioctl.h> header file. */
 #undef HAVE_SYS_IOCTL_H
 
-- 
2.7.0.229.g701fa7f

#52

Amit Kapila

amit.kapila16@gmail.com

almost 10 years ago

In reply to: Andres Freund (#51)

Re: Performance degradation in commit ac1d794

On Thu, Mar 17, 2016 at 7:34 AM, Andres Freund <andres@anarazel.de> wrote:

Hi,

* I can do a blind rewrite of the windows implementation, but I'm
obviously not going to get that entirely right. So I need some help
from a windows person to test this.

I can help you verifying the windows implementation.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#53

Robert Haas

robertmhaas@gmail.com

almost 10 years ago

In reply to: Andres Freund (#51)

Re: Performance degradation in commit ac1d794

On Wed, Mar 16, 2016 at 10:04 PM, Andres Freund <andres@anarazel.de> wrote:

Questions:
* I'm kinda inclined to merge the win32 and unix latch
implementations. There's already a fair bit in common, and this is
just going to increase that amount.

Don't care either way.

* Right now the caller has to allocate the WaitEvents he's waiting for
locally (likely on the stack), but we also could allocate them as part
of the WaitEventSet. Not sure if that'd be a benefit.

I'm not seeing this. What do you mean?

0001: Looking at this again, I'm no longer sure this is a bug.
Doesn't your patch just check the same conditions in the opposite
order?

0002: I think I reviewed this before. Boring. Just commit it already.

0003: Mostly boring. But the change to win32_latch.c seems to remove
an unrelated check.

0004:

+         * drain it everytime WaitLatchOrSocket() is used. Should the
+         * pipe-buffer fill up in some scenarios - widly unlikely - we're

every time
wildly

Why is it wildly (or widly) unlikely?

The rejiggering this does between what is on which element of pfds[]
appears to be unrelated to the ostensible purpose of the patch.

+ * Check again wether latch is set, the arrival of a signal/self-byte

whether. Also not clearly related to the patch's main purpose.

             /* at least one event occurred, so check masks */
+            if (FD_ISSET(selfpipe_readfd, &input_mask))
+            {
+                /* There's data in the self-pipe, clear it. */
+                drainSelfPipe();
+            }

The comment just preceding this added hunk now seems to be out of
place, and maybe inaccurate as well. I think the new code could have
a bit more detailed comment. My understanding is something like /*
Since we didn't clear the self-pipe before attempting to wait,
select() may have returned immediately even though there has been no
recent change to the state of the latch. To prevent busy-looping, we
must clear the pipe before attempting to wait again. */

I'll look at 0005 next, but thought I would send these comments along first.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#54

Robert Haas

robertmhaas@gmail.com

almost 10 years ago

In reply to: Robert Haas (#53)

Re: Performance degradation in commit ac1d794

On Thu, Mar 17, 2016 at 9:01 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I'll look at 0005 next, but thought I would send these comments along first.

0005: This is obviously very much WIP, but I think the overall
direction of it is good.
0006: Same.

I think you should use PGINVALID_SOCKET rather than -1 in various
places in various patches in this series, especially if you are going
to try to merge the Windows code path.

I wonder if CreateEventSet should accept a MemoryContext argument. It
seems like callers will usually want TopMemoryContext, and just being
able to pass that might be more convenient than having to switch back
and forth in the calling code.

I wonder if there's a way to refactor this code to avoid having so
much cut-and-paste duplication.

When iterating over the returned events, maybe check whether events is
0 at the top of the loop and skip it forthwith if so.

That's all I've got for now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#55

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Robert Haas (#54)

Re: Performance degradation in commit ac1d794

On 2016-03-17 09:40:08 -0400, Robert Haas wrote:

On Thu, Mar 17, 2016 at 9:01 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I'll look at 0005 next, but thought I would send these comments along first.

0005: This is obviously very much WIP, but I think the overall
direction of it is good.
0006: Same.

I think you should use PGINVALID_SOCKET rather than -1 in various
places in various patches in this series, especially if you are going
to try to merge the Windows code path.

Sure.

I wonder if CreateEventSet should accept a MemoryContext argument. It
seems like callers will usually want TopMemoryContext, and just being
able to pass that might be more convenient than having to switch back
and forth in the calling code.

Makes sense.

I wonder if there's a way to refactor this code to avoid having so
much cut-and-paste duplication.

I guess you mean WaitEventSetWait() and WaitEventAdjust*? I've tried,
and my attempt ended up look nearly unreadable, because of the number of
ifdefs. I've not found a good attempt. Which is sad, because adding back
select support is going to increase the duplication further :( - but
it's also further away from poll etc. (different type of timestamp,
entirely different way of returming events).

When iterating over the returned events, maybe check whether events is
0 at the top of the loop and skip it forthwith if so.

You mean in WaitEventSetWait()? There's
else if (rc == 0)
{
break;
}
which is the timeout case. There should never be any other case of
returning 0 elements?

That's all I've got for now.

Thanks for looking.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#56

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Robert Haas (#53)

Re: Performance degradation in commit ac1d794

Hi,

On 2016-03-17 09:01:36 -0400, Robert Haas wrote:

0001: Looking at this again, I'm no longer sure this is a bug.
Doesn't your patch just check the same conditions in the opposite
order?

Yes, that's what's required

0004:
+         * drain it everytime WaitLatchOrSocket() is used. Should the
+         * pipe-buffer fill up in some scenarios - widly unlikely - we're
every time
wildly

Why is it wildly (or widly) unlikely?

The rejiggering this does between what is on which element of pfds[]
appears to be unrelated to the ostensible purpose of the patch.

Well, not really. We need to know when to do drainSelfPipe(); Which gets
more complicated if pfds[0] is registered optionally.

I'm actually considering to drop this entirely, given the much heavier
rework in the WaitEvent set patch; making these details a bit obsolete.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#57

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Robert Haas (#53)

Re: Performance degradation in commit ac1d794

On 2016-03-17 09:01:36 -0400, Robert Haas wrote:

* Right now the caller has to allocate the WaitEvents he's waiting for
locally (likely on the stack), but we also could allocate them as part
of the WaitEventSet. Not sure if that'd be a benefit.

I'm not seeing this. What do you mean?

Right now, do use a WaitEventSet you'd do something like
WaitEvent event;

ModifyWaitEvent(FeBeWaitSet, 0, waitfor, NULL);

WaitEventSetWait(FeBeWaitSet, 0 /* no timeout */, &event, 1);

i.e. use a WaitEvent on the stack to receive the changes. If you wanted
to get more changes than just one, you could end up allocating a fair
bit of stack space.

We could instead allocate the returned events as part of the event set,
and return them. Either by returning a NULL terminated array, or by
continuing to return the number of events as now, and additionally
return the event data structure via a pointer.

So the above would be

WaitEvent *events;
int nevents;

ModifyWaitEvent(FeBeWaitSet, 0, waitfor, NULL);

nevents = WaitEventSetWait(FeBeWaitSet, 0 /* no timeout */, events, 10);

for (int off = 0; off <= nevents; nevents++)
; // stuff

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#58

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Robert Haas (#53)

5 attachment(s)

Re: Performance degradation in commit ac1d794

Hi,

On 2016-03-17 09:01:36 -0400, Robert Haas wrote:

0001: Looking at this again, I'm no longer sure this is a bug.
Doesn't your patch just check the same conditions in the opposite
order?

Which is important, because what's in what pfds[x] depends on
wakeEvents. Folded it into a later patch; it's not harmful as long as
we're only ever testing pfds[0].

0003: Mostly boring. But the change to win32_latch.c seems to remove
an unrelated check.

Argh.

0004:

+         * drain it everytime WaitLatchOrSocket() is used. Should the
+         * pipe-buffer fill up in some scenarios - widly unlikely - we're

every time
wildly

Why is it wildly (or widly) unlikely?

Because SetLatch (if the owner) check latch->is_set before adding to the
pipe, and latch_sigusr1_handler() only writes to the pipe if the current
process is in WaitLatchOrSocket's loop (via the waiting check). Expanded
comment.

+ * Check again wether latch is set, the arrival of a signal/self-byte

whether. Also not clearly related to the patch's main purpose.

After the change there's no need to re-compute the current timestamp
anymore, that does seem beneficial and kinda related.

/* at least one event occurred, so check masks */
+            if (FD_ISSET(selfpipe_readfd, &input_mask))
+            {
+                /* There's data in the self-pipe, clear it. */
+                drainSelfPipe();
+            }
The comment just preceding this added hunk now seems to be out of
place, and maybe inaccurate as well.

Hm. Which comment are you exactly referring to?
/* at least one event occurred, so check masks */
seems not to fit the bill?

I think the new code could have
a bit more detailed comment. My understanding is something like /*
Since we didn't clear the self-pipe before attempting to wait,
select() may have returned immediately even though there has been no
recent change to the state of the latch. To prevent busy-looping, we
must clear the pipe before attempting to wait again. */

Isn't that explained at the top, in
/*
* Check if the latch is set already. If so, leave loop immediately,
* avoid blocking again. We don't attempt to report any other events
* that might also be satisfied.
*
* If someone sets the latch between this and the poll()/select()
* below, the setter will write a byte to the pipe (or signal us and
* the signal handler will do that), and the poll()/select() will
* return immediately.
*
* If there's a pending byte in the self pipe, we'll notice whenever
* blocking. Only clearing the pipe in that case avoids having to
* drain it every time WaitLatchOrSocket() is used. Should the
* pipe-buffer fill up in some scenarios - wildly unlikely - we're
* still ok, because the pipe is in nonblocking mode.
?

I've updated the last paragraph to
* If there's a pending byte in the self pipe, we'll notice whenever
* blocking. Only clearing the pipe in that case avoids having to
* drain it every time WaitLatchOrSocket() is used. Should the
* pipe-buffer fill up we're still ok, because the pipe is in
* nonblocking mode. It's unlikely for that to happen, because the
* self pipe isn't filled unless we're blocking (waiting = true), or
* from inside a signal handler in latch_sigusr1_handler().

I've also applied the same optimization to windows. Less because I found
that interesting in itself, and more because it makes the WaitEventSet
easier.

Attached is a significantly revised version of the earlier series. Most
importantly I have:
* Unified the window/unix latch implementation into one file (0004)
* Provided a select(2) implementation for the WaitEventSet API
* Provided a windows implementation for the WaitEventSet API
* Reduced duplication between the implementations a good bit by
splitting WaitEventSetWait into WaitEventSetWait and
WaitEventSetWaitBlock. Only the latter is implemented separately for
each readiness primitive
* Added a backward-compatibility implementation of WaitLatchOrSocket
using the WaitEventSet stuff. Less because I thought that to be
terribly important, and more because it makes the patch a *lot*
smaller. We collected a fair amount of latch users.

This is still not fully ready. The main reamining items are testing (the
windows stuff I've only verified using cross-compiling with mingw) and
documentation.

I'd greatly appreciate a look.

Amit, you offered testing on windows; could you check whether 3/4/5
work? It's quite likely that I've screwed up something.

Robert you'd mentioned on IM that you've a use-case for this somewhere
around multiple FDWs. If somebody has started working on that, could you
ask that person to check whether the API makes sense?

Greetings,

Andres Freund

Attachments:

0001-Make-it-easier-to-choose-the-used-waiting-primitive-.patchtext/x-patch; charset=us-asciiDownload

From 916b95e211aa017643088ba7cbb239545ac8d944 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 18 Mar 2016 00:52:07 -0700
Subject: [PATCH 1/5] Make it easier to choose the used waiting primitive in
 unix_latch.c.

This allows for easier testing of the different primitives; in
preparation for adding a new primitive.

Discussion: 20160114143931.GG10941@awork2.anarazel.de
Reviewed-By: Robert Haas
---
 src/backend/port/unix_latch.c | 50 +++++++++++++++++++++++++++++--------------
 1 file changed, 34 insertions(+), 16 deletions(-)

diff --git a/src/backend/port/unix_latch.c b/src/backend/port/unix_latch.c
index 2ad609c..f52704b 100644
--- a/src/backend/port/unix_latch.c
+++ b/src/backend/port/unix_latch.c
@@ -56,6 +56,22 @@
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
 
+/*
+ * Select the fd readiness primitive to use. Normally the "most modern"
+ * primitive supported by the OS will be used, but for testing it can be
+ * useful to manually specify the used primitive.  If desired, just add a
+ * define somewhere before this block.
+ */
+#if defined(LATCH_USE_POLL) || defined(LATCH_USE_SELECT)
+/* don't overwrite manual choice */
+#elif defined(HAVE_POLL)
+#define LATCH_USE_POLL
+#elif HAVE_SYS_SELECT_H
+#define LATCH_USE_SELECT
+#else
+#error "no latch implementation available"
+#endif
+
 /* Are we currently in WaitLatch? The signal handler would like to know. */
 static volatile sig_atomic_t waiting = false;
 
@@ -215,10 +231,10 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 				cur_time;
 	long		cur_timeout;
 
-#ifdef HAVE_POLL
+#if defined(LATCH_USE_POLL)
 	struct pollfd pfds[3];
 	int			nfds;
-#else
+#elif defined(LATCH_USE_SELECT)
 	struct timeval tv,
 			   *tvp;
 	fd_set		input_mask;
@@ -247,7 +263,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 		Assert(timeout >= 0 && timeout <= INT_MAX);
 		cur_timeout = timeout;
 
-#ifndef HAVE_POLL
+#ifdef LATCH_USE_SELECT
 		tv.tv_sec = cur_timeout / 1000L;
 		tv.tv_usec = (cur_timeout % 1000L) * 1000L;
 		tvp = &tv;
@@ -257,7 +273,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	{
 		cur_timeout = -1;
 
-#ifndef HAVE_POLL
+#ifdef LATCH_USE_SELECT
 		tvp = NULL;
 #endif
 	}
@@ -291,16 +307,10 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 		}
 
 		/*
-		 * Must wait ... we use poll(2) if available, otherwise select(2).
-		 *
-		 * On at least older linux kernels select(), in violation of POSIX,
-		 * doesn't reliably return a socket as writable if closed - but we
-		 * rely on that. So far all the known cases of this problem are on
-		 * platforms that also provide a poll() implementation without that
-		 * bug.  If we find one where that's not the case, we'll need to add a
-		 * workaround.
+		 * Must wait ... we use the polling interface determined at the top of
+		 * this file to do so.
 		 */
-#ifdef HAVE_POLL
+#if defined(LATCH_USE_POLL)
 		nfds = 0;
 		if (wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))
 		{
@@ -396,8 +406,16 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 					result |= WL_POSTMASTER_DEATH;
 			}
 		}
-#else							/* !HAVE_POLL */
+#elif defined(LATCH_USE_SELECT)
 
+		/*
+		 * On at least older linux kernels select(), in violation of POSIX,
+		 * doesn't reliably return a socket as writable if closed - but we
+		 * rely on that. So far all the known cases of this problem are on
+		 * platforms that also provide a poll() implementation without that
+		 * bug.  If we find one where that's not the case, we'll need to add a
+		 * workaround.
+		 */
 		FD_ZERO(&input_mask);
 		FD_ZERO(&output_mask);
 
@@ -477,7 +495,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 					result |= WL_POSTMASTER_DEATH;
 			}
 		}
-#endif   /* HAVE_POLL */
+#endif   /* LATCH_USE_SELECT */
 
 		/* If we're not done, update cur_timeout for next iteration */
 		if (result == 0 && (wakeEvents & WL_TIMEOUT))
@@ -490,7 +508,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 				/* Timeout has expired, no need to continue looping */
 				result |= WL_TIMEOUT;
 			}
-#ifndef HAVE_POLL
+#ifdef LATCH_USE_SELECT
 			else
 			{
 				tv.tv_sec = cur_timeout / 1000L;
-- 
2.7.0.229.g701fa7f

0002-Error-out-if-waiting-on-socket-readiness-without-a-s.patchtext/x-patch; charset=us-asciiDownload

From 2eeb7dd4f6401a4f2d45293cddd505018aa4431e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 18 Mar 2016 00:52:07 -0700
Subject: [PATCH 2/5] Error out if waiting on socket readiness without a
 specified socket.

Previously we just ignored such an attempt, but that seems to serve no
purpose but making things harder to debug.

Discussion: 20160114143931.GG10941@awork2.anarazel.de
    20151230173734.hx7jj2fnwyljfqek@alap3.anarazel.de
Reviewed-By: Robert Haas
---
 src/backend/port/unix_latch.c  | 9 +++++----
 src/backend/port/win32_latch.c | 9 +++++----
 2 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/src/backend/port/unix_latch.c b/src/backend/port/unix_latch.c
index f52704b..e7be7ec 100644
--- a/src/backend/port/unix_latch.c
+++ b/src/backend/port/unix_latch.c
@@ -242,12 +242,13 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	int			hifd;
 #endif
 
-	/* Ignore WL_SOCKET_* events if no valid socket is given */
-	if (sock == PGINVALID_SOCKET)
-		wakeEvents &= ~(WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE);
-
 	Assert(wakeEvents != 0);	/* must have at least one wake event */
 
+	/* waiting for socket readiness without a socket indicates a bug */
+	if (sock == PGINVALID_SOCKET &&
+		(wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
+		elog(ERROR, "cannot wait on socket event without a socket");
+
 	if ((wakeEvents & WL_LATCH_SET) && latch->owner_pid != MyProcPid)
 		elog(ERROR, "cannot wait on a latch owned by another process");
 
diff --git a/src/backend/port/win32_latch.c b/src/backend/port/win32_latch.c
index 80adc13..b1b0713 100644
--- a/src/backend/port/win32_latch.c
+++ b/src/backend/port/win32_latch.c
@@ -113,12 +113,13 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	int			result = 0;
 	int			pmdeath_eventno = 0;
 
-	/* Ignore WL_SOCKET_* events if no valid socket is given */
-	if (sock == PGINVALID_SOCKET)
-		wakeEvents &= ~(WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE);
-
 	Assert(wakeEvents != 0);	/* must have at least one wake event */
 
+	/* waiting for socket readiness without a socket indicates a bug */
+	if (sock == PGINVALID_SOCKET &&
+		(wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
+		elog(ERROR, "cannot wait on socket event without a socket");
+
 	if ((wakeEvents & WL_LATCH_SET) && latch->owner_pid != MyProcPid)
 		elog(ERROR, "cannot wait on a latch owned by another process");
 
-- 
2.7.0.229.g701fa7f

0003-Only-clear-latch-self-pipe-event-if-there-is-a-pendi.patchtext/x-patch; charset=us-asciiDownload

From d76ac6f857c4c273a54b3f9b914363587667f435 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 18 Mar 2016 00:52:07 -0700
Subject: [PATCH 3/5] Only clear latch self-pipe/event if there is a pending
 notification.

This avoids a good number of, individually quite fast, system calls in
scenarios with many quick queries. Besides the aesthetic benefit of
seing fewer superflous system calls with strace, it also improves
performance by ~2% measured by pgbench -M prepared -c 96 -j 8 -S (scale
100).

Without having benchmarked it, this patch also adjust the windows code,
as that makes it easier to unify the unix/windows codepaths in a later
patch. There's little reason to diverge in behaviour between the
platforms.

Discussion: CA+TgmoYc1Zm+Szoc_Qbzi92z2c1vRHZmjhfPn5uC=w8bXv6Avg@mail.gmail.com
Reviewed-By: Robert Haas
---
 src/backend/port/unix_latch.c  | 81 ++++++++++++++++++++++++++++--------------
 src/backend/port/win32_latch.c | 19 +++++-----
 2 files changed, 65 insertions(+), 35 deletions(-)

diff --git a/src/backend/port/unix_latch.c b/src/backend/port/unix_latch.c
index e7be7ec..104401d 100644
--- a/src/backend/port/unix_latch.c
+++ b/src/backend/port/unix_latch.c
@@ -283,27 +283,31 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	do
 	{
 		/*
-		 * Clear the pipe, then check if the latch is set already. If someone
-		 * sets the latch between this and the poll()/select() below, the
-		 * setter will write a byte to the pipe (or signal us and the signal
-		 * handler will do that), and the poll()/select() will return
-		 * immediately.
+		 * Check if the latch is set already. If so, leave loop immediately,
+		 * avoid blocking again. We don't attempt to report any other events
+		 * that might also be satisfied.
+		 *
+		 * If someone sets the latch between this and the poll()/select()
+		 * below, the setter will write a byte to the pipe (or signal us and
+		 * the signal handler will do that), and the poll()/select() will
+		 * return immediately.
+		 *
+		 * If there's a pending byte in the self pipe, we'll notice whenever
+		 * blocking. Only clearing the pipe in that case avoids having to
+		 * drain it every time WaitLatchOrSocket() is used. Should the
+		 * pipe-buffer fill up we're still ok, because the pipe is in
+		 * nonblocking mode. It's unlikely for that to happen, because the
+		 * self pipe isn't filled unless we're blocking (waiting = true), or
+		 * from inside a signal handler in latch_sigusr1_handler().
 		 *
 		 * Note: we assume that the kernel calls involved in drainSelfPipe()
 		 * and SetLatch() will provide adequate synchronization on machines
 		 * with weak memory ordering, so that we cannot miss seeing is_set if
 		 * the signal byte is already in the pipe when we drain it.
 		 */
-		drainSelfPipe();
-
 		if ((wakeEvents & WL_LATCH_SET) && latch->is_set)
 		{
 			result |= WL_LATCH_SET;
-
-			/*
-			 * Leave loop immediately, avoid blocking again. We don't attempt
-			 * to report any other events that might also be satisfied.
-			 */
 			break;
 		}
 
@@ -313,24 +317,26 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 		 */
 #if defined(LATCH_USE_POLL)
 		nfds = 0;
+
+		/* selfpipe is always in pfds[0] */
+		pfds[0].fd = selfpipe_readfd;
+		pfds[0].events = POLLIN;
+		pfds[0].revents = 0;
+		nfds++;
+
 		if (wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))
 		{
-			/* socket, if used, is always in pfds[0] */
-			pfds[0].fd = sock;
-			pfds[0].events = 0;
+			/* socket, if used, is always in pfds[1] */
+			pfds[1].fd = sock;
+			pfds[1].events = 0;
 			if (wakeEvents & WL_SOCKET_READABLE)
-				pfds[0].events |= POLLIN;
+				pfds[1].events |= POLLIN;
 			if (wakeEvents & WL_SOCKET_WRITEABLE)
-				pfds[0].events |= POLLOUT;
-			pfds[0].revents = 0;
+				pfds[1].events |= POLLOUT;
+			pfds[1].revents = 0;
 			nfds++;
 		}
 
-		pfds[nfds].fd = selfpipe_readfd;
-		pfds[nfds].events = POLLIN;
-		pfds[nfds].revents = 0;
-		nfds++;
-
 		if (wakeEvents & WL_POSTMASTER_DEATH)
 		{
 			/* postmaster fd, if used, is always in pfds[nfds - 1] */
@@ -364,19 +370,27 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 		else
 		{
 			/* at least one event occurred, so check revents values */
+
+			if (pfds[0].revents & POLLIN)
+			{
+				/* There's data in the self-pipe, clear it. */
+				drainSelfPipe();
+			}
+
 			if ((wakeEvents & WL_SOCKET_READABLE) &&
-				(pfds[0].revents & POLLIN))
+				(pfds[1].revents & POLLIN))
 			{
 				/* data available in socket, or EOF/error condition */
 				result |= WL_SOCKET_READABLE;
 			}
 			if ((wakeEvents & WL_SOCKET_WRITEABLE) &&
-				(pfds[0].revents & POLLOUT))
+				(pfds[1].revents & POLLOUT))
 			{
 				/* socket is writable */
 				result |= WL_SOCKET_WRITEABLE;
 			}
-			if (pfds[0].revents & (POLLHUP | POLLERR | POLLNVAL))
+			if ((wakeEvents & WL_SOCKET_WRITEABLE) &&
+				(pfds[1].revents & (POLLHUP | POLLERR | POLLNVAL)))
 			{
 				/* EOF/error condition */
 				if (wakeEvents & WL_SOCKET_READABLE)
@@ -468,6 +482,11 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 		else
 		{
 			/* at least one event occurred, so check masks */
+			if (FD_ISSET(selfpipe_readfd, &input_mask))
+			{
+				/* There's data in the self-pipe, clear it. */
+				drainSelfPipe();
+			}
 			if ((wakeEvents & WL_SOCKET_READABLE) && FD_ISSET(sock, &input_mask))
 			{
 				/* data available in socket, or EOF */
@@ -498,6 +517,16 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 		}
 #endif   /* LATCH_USE_SELECT */
 
+		/*
+		 * Check again whether latch is set, the arrival of a signal/self-byte
+		 * might be what stopped our sleep. It's not required for correctness
+		 * to signal the latch as being set (we'd just loop if there's no
+		 * other event), but it seems good to report an arrived latch asap.
+		 * This way we also don't have to compute the current timestamp again.
+		 */
+		if ((wakeEvents & WL_LATCH_SET) && latch->is_set)
+			result |= WL_LATCH_SET;
+
 		/* If we're not done, update cur_timeout for next iteration */
 		if (result == 0 && (wakeEvents & WL_TIMEOUT))
 		{
diff --git a/src/backend/port/win32_latch.c b/src/backend/port/win32_latch.c
index b1b0713..bbf1b24 100644
--- a/src/backend/port/win32_latch.c
+++ b/src/backend/port/win32_latch.c
@@ -181,14 +181,11 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	do
 	{
 		/*
-		 * Reset the event, and check if the latch is set already. If someone
-		 * sets the latch between this and the WaitForMultipleObjects() call
-		 * below, the setter will set the event and WaitForMultipleObjects()
-		 * will return immediately.
+		 * The comment in unix_latch.c's equivalent to this applies here as
+		 * well. At least after mentally replacing self-pipe with windows
+		 * event. There's no danger of overflowing, as "Setting an event that
+		 * is already set has no effect.".
 		 */
-		if (!ResetEvent(latchevent))
-			elog(ERROR, "ResetEvent failed: error code %lu", GetLastError());
-
 		if ((wakeEvents & WL_LATCH_SET) && latch->is_set)
 		{
 			result |= WL_LATCH_SET;
@@ -217,9 +214,13 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 		else if (rc == WAIT_OBJECT_0 + 1)
 		{
 			/*
-			 * Latch is set.  We'll handle that on next iteration of loop, but
-			 * let's not waste the cycles to update cur_timeout below.
+			 * Reset the event.  We'll re-check the, potentially, set latch on
+			 * next iteration of loop, but let's not waste the cycles to
+			 * update cur_timeout below.
 			 */
+			if (!ResetEvent(latchevent))
+				elog(ERROR, "ResetEvent failed: error code %lu", GetLastError());
+
 			continue;
 		}
 		else if ((wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) &&
-- 
2.7.0.229.g701fa7f

0004-Combine-win32-and-unix-latch-implementations.patchtext/x-patch; charset=us-asciiDownload

From 1d444b0855dbf65d66d73beb647b772fff3404c8 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 18 Mar 2016 00:52:07 -0700
Subject: [PATCH 4/5] Combine win32 and unix latch implementations.

Previously latches for windows and unix had been implemented in
different files. The next patch in this series will introduce an
expanded wait infrastructure, keeping the implementation separate would
introduce too much duplication.

This basically just moves the functions, without too much change. The
reason to keep this separate is that it allows blame to continue working
a little less badly; and to make review a tiny bit easier.
---
 configure                                          |  10 +-
 configure.in                                       |   8 -
 src/backend/Makefile                               |   3 +-
 src/backend/port/.gitignore                        |   1 -
 src/backend/port/Makefile                          |   2 +-
 src/backend/port/win32_latch.c                     | 349 ---------------------
 src/backend/storage/ipc/Makefile                   |   5 +-
 .../{port/unix_latch.c => storage/ipc/latch.c}     | 280 ++++++++++++++++-
 src/include/storage/latch.h                        |   2 +-
 src/tools/msvc/Mkvcbuild.pm                        |   2 -
 10 files changed, 277 insertions(+), 385 deletions(-)
 delete mode 100644 src/backend/port/win32_latch.c
 rename src/backend/{port/unix_latch.c => storage/ipc/latch.c} (74%)

diff --git a/configure b/configure
index a45be67..c10d954 100755
--- a/configure
+++ b/configure
@@ -14786,13 +14786,6 @@ $as_echo "#define USE_WIN32_SHARED_MEMORY 1" >>confdefs.h
   SHMEM_IMPLEMENTATION="src/backend/port/win32_shmem.c"
 fi
 
-# Select latch implementation type.
-if test "$PORTNAME" != "win32"; then
-  LATCH_IMPLEMENTATION="src/backend/port/unix_latch.c"
-else
-  LATCH_IMPLEMENTATION="src/backend/port/win32_latch.c"
-fi
-
 # If not set in template file, set bytes to use libc memset()
 if test x"$MEMSET_LOOP_LIMIT" = x"" ; then
   MEMSET_LOOP_LIMIT=1024
@@ -15868,7 +15861,7 @@ fi
 ac_config_files="$ac_config_files GNUmakefile src/Makefile.global"
 
 
-ac_config_links="$ac_config_links src/backend/port/dynloader.c:src/backend/port/dynloader/${template}.c src/backend/port/pg_sema.c:${SEMA_IMPLEMENTATION} src/backend/port/pg_shmem.c:${SHMEM_IMPLEMENTATION} src/backend/port/pg_latch.c:${LATCH_IMPLEMENTATION} src/include/dynloader.h:src/backend/port/dynloader/${template}.h src/include/pg_config_os.h:src/include/port/${template}.h src/Makefile.port:src/makefiles/Makefile.${template}"
+ac_config_links="$ac_config_links src/backend/port/dynloader.c:src/backend/port/dynloader/${template}.c src/backend/port/pg_sema.c:${SEMA_IMPLEMENTATION} src/backend/port/pg_shmem.c:${SHMEM_IMPLEMENTATION} src/include/dynloader.h:src/backend/port/dynloader/${template}.h src/include/pg_config_os.h:src/include/port/${template}.h src/Makefile.port:src/makefiles/Makefile.${template}"
 
 
 if test "$PORTNAME" = "win32"; then
@@ -16592,7 +16585,6 @@ do
     "src/backend/port/dynloader.c") CONFIG_LINKS="$CONFIG_LINKS src/backend/port/dynloader.c:src/backend/port/dynloader/${template}.c" ;;
     "src/backend/port/pg_sema.c") CONFIG_LINKS="$CONFIG_LINKS src/backend/port/pg_sema.c:${SEMA_IMPLEMENTATION}" ;;
     "src/backend/port/pg_shmem.c") CONFIG_LINKS="$CONFIG_LINKS src/backend/port/pg_shmem.c:${SHMEM_IMPLEMENTATION}" ;;
-    "src/backend/port/pg_latch.c") CONFIG_LINKS="$CONFIG_LINKS src/backend/port/pg_latch.c:${LATCH_IMPLEMENTATION}" ;;
     "src/include/dynloader.h") CONFIG_LINKS="$CONFIG_LINKS src/include/dynloader.h:src/backend/port/dynloader/${template}.h" ;;
     "src/include/pg_config_os.h") CONFIG_LINKS="$CONFIG_LINKS src/include/pg_config_os.h:src/include/port/${template}.h" ;;
     "src/Makefile.port") CONFIG_LINKS="$CONFIG_LINKS src/Makefile.port:src/makefiles/Makefile.${template}" ;;
diff --git a/configure.in b/configure.in
index c298926..47d0f58 100644
--- a/configure.in
+++ b/configure.in
@@ -1976,13 +1976,6 @@ else
   SHMEM_IMPLEMENTATION="src/backend/port/win32_shmem.c"
 fi
 
-# Select latch implementation type.
-if test "$PORTNAME" != "win32"; then
-  LATCH_IMPLEMENTATION="src/backend/port/unix_latch.c"
-else
-  LATCH_IMPLEMENTATION="src/backend/port/win32_latch.c"
-fi
-
 # If not set in template file, set bytes to use libc memset()
 if test x"$MEMSET_LOOP_LIMIT" = x"" ; then
   MEMSET_LOOP_LIMIT=1024
@@ -2178,7 +2171,6 @@ AC_CONFIG_LINKS([
   src/backend/port/dynloader.c:src/backend/port/dynloader/${template}.c
   src/backend/port/pg_sema.c:${SEMA_IMPLEMENTATION}
   src/backend/port/pg_shmem.c:${SHMEM_IMPLEMENTATION}
-  src/backend/port/pg_latch.c:${LATCH_IMPLEMENTATION}
   src/include/dynloader.h:src/backend/port/dynloader/${template}.h
   src/include/pg_config_os.h:src/include/port/${template}.h
   src/Makefile.port:src/makefiles/Makefile.${template}
diff --git a/src/backend/Makefile b/src/backend/Makefile
index b3d5e2e..d22dbbf 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -306,8 +306,7 @@ ifeq ($(PORTNAME), win32)
 endif
 
 distclean: clean
-	rm -f port/tas.s port/dynloader.c port/pg_sema.c port/pg_shmem.c \
-	      port/pg_latch.c
+	rm -f port/tas.s port/dynloader.c port/pg_sema.c port/pg_shmem.c
 
 maintainer-clean: distclean
 	rm -f bootstrap/bootparse.c \
diff --git a/src/backend/port/.gitignore b/src/backend/port/.gitignore
index 7d3ac4a..9f4f1af 100644
--- a/src/backend/port/.gitignore
+++ b/src/backend/port/.gitignore
@@ -1,5 +1,4 @@
 /dynloader.c
-/pg_latch.c
 /pg_sema.c
 /pg_shmem.c
 /tas.s
diff --git a/src/backend/port/Makefile b/src/backend/port/Makefile
index c6b1d20..89549d0 100644
--- a/src/backend/port/Makefile
+++ b/src/backend/port/Makefile
@@ -21,7 +21,7 @@ subdir = src/backend/port
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = atomics.o dynloader.o pg_sema.o pg_shmem.o pg_latch.o $(TAS)
+OBJS = atomics.o dynloader.o pg_sema.o pg_shmem.o $(TAS)
 
 ifeq ($(PORTNAME), darwin)
 SUBDIRS += darwin
diff --git a/src/backend/port/win32_latch.c b/src/backend/port/win32_latch.c
deleted file mode 100644
index bbf1b24..0000000
--- a/src/backend/port/win32_latch.c
+++ /dev/null
@@ -1,349 +0,0 @@
-/*-------------------------------------------------------------------------
- *
- * win32_latch.c
- *	  Routines for inter-process latches
- *
- * See unix_latch.c for header comments for the exported functions;
- * the API presented here is supposed to be the same as there.
- *
- * The Windows implementation uses Windows events that are inherited by
- * all postmaster child processes.
- *
- * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
- * Portions Copyright (c) 1994, Regents of the University of California
- *
- * IDENTIFICATION
- *	  src/backend/port/win32_latch.c
- *
- *-------------------------------------------------------------------------
- */
-#include "postgres.h"
-
-#include <fcntl.h>
-#include <limits.h>
-#include <signal.h>
-#include <unistd.h>
-
-#include "miscadmin.h"
-#include "portability/instr_time.h"
-#include "postmaster/postmaster.h"
-#include "storage/barrier.h"
-#include "storage/latch.h"
-#include "storage/pmsignal.h"
-#include "storage/shmem.h"
-
-
-void
-InitializeLatchSupport(void)
-{
-	/* currently, nothing to do here for Windows */
-}
-
-void
-InitLatch(volatile Latch *latch)
-{
-	latch->is_set = false;
-	latch->owner_pid = MyProcPid;
-	latch->is_shared = false;
-
-	latch->event = CreateEvent(NULL, TRUE, FALSE, NULL);
-	if (latch->event == NULL)
-		elog(ERROR, "CreateEvent failed: error code %lu", GetLastError());
-}
-
-void
-InitSharedLatch(volatile Latch *latch)
-{
-	SECURITY_ATTRIBUTES sa;
-
-	latch->is_set = false;
-	latch->owner_pid = 0;
-	latch->is_shared = true;
-
-	/*
-	 * Set up security attributes to specify that the events are inherited.
-	 */
-	ZeroMemory(&sa, sizeof(sa));
-	sa.nLength = sizeof(sa);
-	sa.bInheritHandle = TRUE;
-
-	latch->event = CreateEvent(&sa, TRUE, FALSE, NULL);
-	if (latch->event == NULL)
-		elog(ERROR, "CreateEvent failed: error code %lu", GetLastError());
-}
-
-void
-OwnLatch(volatile Latch *latch)
-{
-	/* Sanity checks */
-	Assert(latch->is_shared);
-	if (latch->owner_pid != 0)
-		elog(ERROR, "latch already owned");
-
-	latch->owner_pid = MyProcPid;
-}
-
-void
-DisownLatch(volatile Latch *latch)
-{
-	Assert(latch->is_shared);
-	Assert(latch->owner_pid == MyProcPid);
-
-	latch->owner_pid = 0;
-}
-
-int
-WaitLatch(volatile Latch *latch, int wakeEvents, long timeout)
-{
-	return WaitLatchOrSocket(latch, wakeEvents, PGINVALID_SOCKET, timeout);
-}
-
-int
-WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
-				  long timeout)
-{
-	DWORD		rc;
-	instr_time	start_time,
-				cur_time;
-	long		cur_timeout;
-	HANDLE		events[4];
-	HANDLE		latchevent;
-	HANDLE		sockevent = WSA_INVALID_EVENT;
-	int			numevents;
-	int			result = 0;
-	int			pmdeath_eventno = 0;
-
-	Assert(wakeEvents != 0);	/* must have at least one wake event */
-
-	/* waiting for socket readiness without a socket indicates a bug */
-	if (sock == PGINVALID_SOCKET &&
-		(wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
-		elog(ERROR, "cannot wait on socket event without a socket");
-
-	if ((wakeEvents & WL_LATCH_SET) && latch->owner_pid != MyProcPid)
-		elog(ERROR, "cannot wait on a latch owned by another process");
-
-	/*
-	 * Initialize timeout if requested.  We must record the current time so
-	 * that we can determine the remaining timeout if WaitForMultipleObjects
-	 * is interrupted.
-	 */
-	if (wakeEvents & WL_TIMEOUT)
-	{
-		INSTR_TIME_SET_CURRENT(start_time);
-		Assert(timeout >= 0 && timeout <= INT_MAX);
-		cur_timeout = timeout;
-	}
-	else
-		cur_timeout = INFINITE;
-
-	/*
-	 * Construct an array of event handles for WaitforMultipleObjects().
-	 *
-	 * Note: pgwin32_signal_event should be first to ensure that it will be
-	 * reported when multiple events are set.  We want to guarantee that
-	 * pending signals are serviced.
-	 */
-	latchevent = latch->event;
-
-	events[0] = pgwin32_signal_event;
-	events[1] = latchevent;
-	numevents = 2;
-	if (wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))
-	{
-		/* Need an event object to represent events on the socket */
-		int			flags = FD_CLOSE;	/* always check for errors/EOF */
-
-		if (wakeEvents & WL_SOCKET_READABLE)
-			flags |= FD_READ;
-		if (wakeEvents & WL_SOCKET_WRITEABLE)
-			flags |= FD_WRITE;
-
-		sockevent = WSACreateEvent();
-		if (sockevent == WSA_INVALID_EVENT)
-			elog(ERROR, "failed to create event for socket: error code %u",
-				 WSAGetLastError());
-		if (WSAEventSelect(sock, sockevent, flags) != 0)
-			elog(ERROR, "failed to set up event for socket: error code %u",
-				 WSAGetLastError());
-
-		events[numevents++] = sockevent;
-	}
-	if (wakeEvents & WL_POSTMASTER_DEATH)
-	{
-		pmdeath_eventno = numevents;
-		events[numevents++] = PostmasterHandle;
-	}
-
-	/* Ensure that signals are serviced even if latch is already set */
-	pgwin32_dispatch_queued_signals();
-
-	do
-	{
-		/*
-		 * The comment in unix_latch.c's equivalent to this applies here as
-		 * well. At least after mentally replacing self-pipe with windows
-		 * event. There's no danger of overflowing, as "Setting an event that
-		 * is already set has no effect.".
-		 */
-		if ((wakeEvents & WL_LATCH_SET) && latch->is_set)
-		{
-			result |= WL_LATCH_SET;
-
-			/*
-			 * Leave loop immediately, avoid blocking again. We don't attempt
-			 * to report any other events that might also be satisfied.
-			 */
-			break;
-		}
-
-		rc = WaitForMultipleObjects(numevents, events, FALSE, cur_timeout);
-
-		if (rc == WAIT_FAILED)
-			elog(ERROR, "WaitForMultipleObjects() failed: error code %lu",
-				 GetLastError());
-		else if (rc == WAIT_TIMEOUT)
-		{
-			result |= WL_TIMEOUT;
-		}
-		else if (rc == WAIT_OBJECT_0)
-		{
-			/* Service newly-arrived signals */
-			pgwin32_dispatch_queued_signals();
-		}
-		else if (rc == WAIT_OBJECT_0 + 1)
-		{
-			/*
-			 * Reset the event.  We'll re-check the, potentially, set latch on
-			 * next iteration of loop, but let's not waste the cycles to
-			 * update cur_timeout below.
-			 */
-			if (!ResetEvent(latchevent))
-				elog(ERROR, "ResetEvent failed: error code %lu", GetLastError());
-
-			continue;
-		}
-		else if ((wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) &&
-				 rc == WAIT_OBJECT_0 + 2)		/* socket is at event slot 2 */
-		{
-			WSANETWORKEVENTS resEvents;
-
-			ZeroMemory(&resEvents, sizeof(resEvents));
-			if (WSAEnumNetworkEvents(sock, sockevent, &resEvents) != 0)
-				elog(ERROR, "failed to enumerate network events: error code %u",
-					 WSAGetLastError());
-			if ((wakeEvents & WL_SOCKET_READABLE) &&
-				(resEvents.lNetworkEvents & FD_READ))
-			{
-				result |= WL_SOCKET_READABLE;
-			}
-			if ((wakeEvents & WL_SOCKET_WRITEABLE) &&
-				(resEvents.lNetworkEvents & FD_WRITE))
-			{
-				result |= WL_SOCKET_WRITEABLE;
-			}
-			if (resEvents.lNetworkEvents & FD_CLOSE)
-			{
-				if (wakeEvents & WL_SOCKET_READABLE)
-					result |= WL_SOCKET_READABLE;
-				if (wakeEvents & WL_SOCKET_WRITEABLE)
-					result |= WL_SOCKET_WRITEABLE;
-			}
-		}
-		else if ((wakeEvents & WL_POSTMASTER_DEATH) &&
-				 rc == WAIT_OBJECT_0 + pmdeath_eventno)
-		{
-			/*
-			 * Postmaster apparently died.  Since the consequences of falsely
-			 * returning WL_POSTMASTER_DEATH could be pretty unpleasant, we
-			 * take the trouble to positively verify this with
-			 * PostmasterIsAlive(), even though there is no known reason to
-			 * think that the event could be falsely set on Windows.
-			 */
-			if (!PostmasterIsAlive())
-				result |= WL_POSTMASTER_DEATH;
-		}
-		else
-			elog(ERROR, "unexpected return code from WaitForMultipleObjects(): %lu", rc);
-
-		/* If we're not done, update cur_timeout for next iteration */
-		if (result == 0 && (wakeEvents & WL_TIMEOUT))
-		{
-			INSTR_TIME_SET_CURRENT(cur_time);
-			INSTR_TIME_SUBTRACT(cur_time, start_time);
-			cur_timeout = timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
-			if (cur_timeout <= 0)
-			{
-				/* Timeout has expired, no need to continue looping */
-				result |= WL_TIMEOUT;
-			}
-		}
-	} while (result == 0);
-
-	/* Clean up the event object we created for the socket */
-	if (sockevent != WSA_INVALID_EVENT)
-	{
-		WSAEventSelect(sock, NULL, 0);
-		WSACloseEvent(sockevent);
-	}
-
-	return result;
-}
-
-/*
- * The comments above the unix implementation (unix_latch.c) of this function
- * apply here as well.
- */
-void
-SetLatch(volatile Latch *latch)
-{
-	HANDLE		handle;
-
-	/*
-	 * The memory barrier has be to be placed here to ensure that any flag
-	 * variables possibly changed by this process have been flushed to main
-	 * memory, before we check/set is_set.
-	 */
-	pg_memory_barrier();
-
-	/* Quick exit if already set */
-	if (latch->is_set)
-		return;
-
-	latch->is_set = true;
-
-	/*
-	 * See if anyone's waiting for the latch. It can be the current process if
-	 * we're in a signal handler.
-	 *
-	 * Use a local variable here just in case somebody changes the event field
-	 * concurrently (which really should not happen).
-	 */
-	handle = latch->event;
-	if (handle)
-	{
-		SetEvent(handle);
-
-		/*
-		 * Note that we silently ignore any errors. We might be in a signal
-		 * handler or other critical path where it's not safe to call elog().
-		 */
-	}
-}
-
-void
-ResetLatch(volatile Latch *latch)
-{
-	/* Only the owner should reset the latch */
-	Assert(latch->owner_pid == MyProcPid);
-
-	latch->is_set = false;
-
-	/*
-	 * Ensure that the write to is_set gets flushed to main memory before we
-	 * examine any flag variables.  Otherwise a concurrent SetLatch might
-	 * falsely conclude that it needn't signal us, even though we have missed
-	 * seeing some flag updates that SetLatch was supposed to inform us of.
-	 */
-	pg_memory_barrier();
-}
diff --git a/src/backend/storage/ipc/Makefile b/src/backend/storage/ipc/Makefile
index d8eb742..8a55392 100644
--- a/src/backend/storage/ipc/Makefile
+++ b/src/backend/storage/ipc/Makefile
@@ -8,7 +8,8 @@ subdir = src/backend/storage/ipc
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = dsm_impl.o dsm.o ipc.o ipci.o pmsignal.o procarray.o procsignal.o \
-	shmem.o shmqueue.o shm_mq.o shm_toc.o sinval.o sinvaladt.o standby.o
+OBJS = dsm_impl.o dsm.o ipc.o ipci.o latch.o pmsignal.o procarray.o \
+	procsignal.o  shmem.o shmqueue.o shm_mq.o shm_toc.o sinval.o \
+	sinvaladt.o standby.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/port/unix_latch.c b/src/backend/storage/ipc/latch.c
similarity index 74%
rename from src/backend/port/unix_latch.c
rename to src/backend/storage/ipc/latch.c
index 104401d..143d2a1 100644
--- a/src/backend/port/unix_latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -1,6 +1,6 @@
 /*-------------------------------------------------------------------------
  *
- * unix_latch.c
+ * latch.c
  *	  Routines for inter-process latches
  *
  * The Unix implementation uses the so-called self-pipe trick to overcome
@@ -22,11 +22,14 @@
  * process, SIGUSR1 is sent and the signal handler in the waiting process
  * writes the byte to the pipe on behalf of the signaling process.
  *
+ * The Windows implementation uses Windows events that are inherited by
+ * all postmaster child processes.
+ *
  * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
  * IDENTIFICATION
- *	  src/backend/port/unix_latch.c
+ *	  src/backend/storage/ipc/latch.c
  *
  *-------------------------------------------------------------------------
  */
@@ -62,16 +65,19 @@
  * useful to manually specify the used primitive.  If desired, just add a
  * define somewhere before this block.
  */
-#if defined(LATCH_USE_POLL) || defined(LATCH_USE_SELECT)
+#if defined(LATCH_USE_POLL) || defined(LATCH_USE_SELECT) || defined(LATCH_USE_WIN32)
 /* don't overwrite manual choice */
 #elif defined(HAVE_POLL)
 #define LATCH_USE_POLL
 #elif HAVE_SYS_SELECT_H
 #define LATCH_USE_SELECT
+#elif WIN32
+#define LATCH_USE_WIN32
 #else
 #error "no latch implementation available"
 #endif
 
+#ifndef WIN32
 /* Are we currently in WaitLatch? The signal handler would like to know. */
 static volatile sig_atomic_t waiting = false;
 
@@ -82,6 +88,7 @@ static int	selfpipe_writefd = -1;
 /* Private function prototypes */
 static void sendSelfPipeByte(void);
 static void drainSelfPipe(void);
+#endif   /* WIN32 */
 
 
 /*
@@ -93,6 +100,7 @@ static void drainSelfPipe(void);
 void
 InitializeLatchSupport(void)
 {
+#ifndef WIN32
 	int			pipefd[2];
 
 	Assert(selfpipe_readfd == -1);
@@ -113,6 +121,9 @@ InitializeLatchSupport(void)
 
 	selfpipe_readfd = pipefd[0];
 	selfpipe_writefd = pipefd[1];
+#else
+	/* currently, nothing to do here for Windows */
+#endif
 }
 
 /*
@@ -121,12 +132,18 @@ InitializeLatchSupport(void)
 void
 InitLatch(volatile Latch *latch)
 {
-	/* Assert InitializeLatchSupport has been called in this process */
-	Assert(selfpipe_readfd >= 0);
-
 	latch->is_set = false;
 	latch->owner_pid = MyProcPid;
 	latch->is_shared = false;
+
+#ifndef WIN32
+	/* Assert InitializeLatchSupport has been called in this process */
+	Assert(selfpipe_readfd >= 0);
+#else
+	latch->event = CreateEvent(NULL, TRUE, FALSE, NULL);
+	if (latch->event == NULL)
+		elog(ERROR, "CreateEvent failed: error code %lu", GetLastError());
+#endif   /* WIN32 */
 }
 
 /*
@@ -143,6 +160,21 @@ InitLatch(volatile Latch *latch)
 void
 InitSharedLatch(volatile Latch *latch)
 {
+#ifdef WIN32
+	SECURITY_ATTRIBUTES sa;
+
+	/*
+	 * Set up security attributes to specify that the events are inherited.
+	 */
+	ZeroMemory(&sa, sizeof(sa));
+	sa.nLength = sizeof(sa);
+	sa.bInheritHandle = TRUE;
+
+	latch->event = CreateEvent(&sa, TRUE, FALSE, NULL);
+	if (latch->event == NULL)
+		elog(ERROR, "CreateEvent failed: error code %lu", GetLastError());
+#endif
+
 	latch->is_set = false;
 	latch->owner_pid = 0;
 	latch->is_shared = true;
@@ -164,12 +196,14 @@ InitSharedLatch(volatile Latch *latch)
 void
 OwnLatch(volatile Latch *latch)
 {
-	/* Assert InitializeLatchSupport has been called in this process */
-	Assert(selfpipe_readfd >= 0);
-
+	/* Sanity checks */
 	Assert(latch->is_shared);
 
-	/* sanity check */
+#ifndef WIN32
+	/* Assert InitializeLatchSupport has been called in this process */
+	Assert(selfpipe_readfd >= 0);
+#endif
+
 	if (latch->owner_pid != 0)
 		elog(ERROR, "latch already owned");
 
@@ -221,6 +255,7 @@ WaitLatch(volatile Latch *latch, int wakeEvents, long timeout)
  * returning the socket as readable/writable or both, depending on
  * WL_SOCKET_READABLE/WL_SOCKET_WRITEABLE being specified.
  */
+#ifndef WIN32
 int
 WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 				  long timeout)
@@ -551,6 +586,198 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 
 	return result;
 }
+#else
+int
+WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
+				  long timeout)
+{
+	DWORD		rc;
+	instr_time	start_time,
+				cur_time;
+	long		cur_timeout;
+	HANDLE		events[4];
+	HANDLE		latchevent;
+	HANDLE		sockevent = WSA_INVALID_EVENT;
+	int			numevents;
+	int			result = 0;
+	int			pmdeath_eventno = 0;
+
+	Assert(wakeEvents != 0);	/* must have at least one wake event */
+
+	/* waiting for socket readiness without a socket indicates a bug */
+	if (sock == PGINVALID_SOCKET &&
+		(wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
+		elog(ERROR, "cannot wait on socket events without a socket");
+
+	if ((wakeEvents & WL_LATCH_SET) && latch->owner_pid != MyProcPid)
+		elog(ERROR, "cannot wait on a latch owned by another process");
+
+	/*
+	 * Initialize timeout if requested.  We must record the current time so
+	 * that we can determine the remaining timeout if WaitForMultipleObjects
+	 * is interrupted.
+	 */
+	if (wakeEvents & WL_TIMEOUT)
+	{
+		INSTR_TIME_SET_CURRENT(start_time);
+		Assert(timeout >= 0 && timeout <= INT_MAX);
+		cur_timeout = timeout;
+	}
+	else
+		cur_timeout = INFINITE;
+
+	/*
+	 * Construct an array of event handles for WaitforMultipleObjects().
+	 *
+	 * Note: pgwin32_signal_event should be first to ensure that it will be
+	 * reported when multiple events are set.  We want to guarantee that
+	 * pending signals are serviced.
+	 */
+	latchevent = latch->event;
+
+	events[0] = pgwin32_signal_event;
+	events[1] = latchevent;
+	numevents = 2;
+	if (wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))
+	{
+		/* Need an event object to represent events on the socket */
+		int			flags = FD_CLOSE;	/* always check for errors/EOF */
+
+		if (wakeEvents & WL_SOCKET_READABLE)
+			flags |= FD_READ;
+		if (wakeEvents & WL_SOCKET_WRITEABLE)
+			flags |= FD_WRITE;
+
+		sockevent = WSACreateEvent();
+		if (sockevent == WSA_INVALID_EVENT)
+			elog(ERROR, "failed to create event for socket: error code %u",
+				 WSAGetLastError());
+		if (WSAEventSelect(sock, sockevent, flags) != 0)
+			elog(ERROR, "failed to set up event for socket: error code %u",
+				 WSAGetLastError());
+
+		events[numevents++] = sockevent;
+	}
+	if (wakeEvents & WL_POSTMASTER_DEATH)
+	{
+		pmdeath_eventno = numevents;
+		events[numevents++] = PostmasterHandle;
+	}
+
+	/* Ensure that signals are serviced even if latch is already set */
+	pgwin32_dispatch_queued_signals();
+
+	do
+	{
+		/*
+		 * Reset the event, and check if the latch is set already. If someone
+		 * sets the latch between this and the WaitForMultipleObjects() call
+		 * below, the setter will set the event and WaitForMultipleObjects()
+		 * will return immediately.
+		 */
+		if (!ResetEvent(latchevent))
+			elog(ERROR, "ResetEvent failed: error code %lu", GetLastError());
+
+		if ((wakeEvents & WL_LATCH_SET) && latch->is_set)
+		{
+			result |= WL_LATCH_SET;
+
+			/*
+			 * Leave loop immediately, avoid blocking again. We don't attempt
+			 * to report any other events that might also be satisfied.
+			 */
+			break;
+		}
+
+		rc = WaitForMultipleObjects(numevents, events, FALSE, cur_timeout);
+
+		if (rc == WAIT_FAILED)
+			elog(ERROR, "WaitForMultipleObjects() failed: error code %lu",
+				 GetLastError());
+		else if (rc == WAIT_TIMEOUT)
+		{
+			result |= WL_TIMEOUT;
+		}
+		else if (rc == WAIT_OBJECT_0)
+		{
+			/* Service newly-arrived signals */
+			pgwin32_dispatch_queued_signals();
+		}
+		else if (rc == WAIT_OBJECT_0 + 1)
+		{
+			/*
+			 * Latch is set.  We'll handle that on next iteration of loop, but
+			 * let's not waste the cycles to update cur_timeout below.
+			 */
+			continue;
+		}
+		else if ((wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) &&
+				 rc == WAIT_OBJECT_0 + 2)		/* socket is at event slot 2 */
+		{
+			WSANETWORKEVENTS resEvents;
+
+			ZeroMemory(&resEvents, sizeof(resEvents));
+			if (WSAEnumNetworkEvents(sock, sockevent, &resEvents) != 0)
+				elog(ERROR, "failed to enumerate network events: error code %u",
+					 WSAGetLastError());
+			if ((wakeEvents & WL_SOCKET_READABLE) &&
+				(resEvents.lNetworkEvents & FD_READ))
+			{
+				result |= WL_SOCKET_READABLE;
+			}
+			if ((wakeEvents & WL_SOCKET_WRITEABLE) &&
+				(resEvents.lNetworkEvents & FD_WRITE))
+			{
+				result |= WL_SOCKET_WRITEABLE;
+			}
+			if (resEvents.lNetworkEvents & FD_CLOSE)
+			{
+				if (wakeEvents & WL_SOCKET_READABLE)
+					result |= WL_SOCKET_READABLE;
+				if (wakeEvents & WL_SOCKET_WRITEABLE)
+					result |= WL_SOCKET_WRITEABLE;
+			}
+		}
+		else if ((wakeEvents & WL_POSTMASTER_DEATH) &&
+				 rc == WAIT_OBJECT_0 + pmdeath_eventno)
+		{
+			/*
+			 * Postmaster apparently died.  Since the consequences of falsely
+			 * returning WL_POSTMASTER_DEATH could be pretty unpleasant, we
+			 * take the trouble to positively verify this with
+			 * PostmasterIsAlive(), even though there is no known reason to
+			 * think that the event could be falsely set on Windows.
+			 */
+			if (!PostmasterIsAlive())
+				result |= WL_POSTMASTER_DEATH;
+		}
+		else
+			elog(ERROR, "unexpected return code from WaitForMultipleObjects(): %lu", rc);
+
+		/* If we're not done, update cur_timeout for next iteration */
+		if (result == 0 && (wakeEvents & WL_TIMEOUT))
+		{
+			INSTR_TIME_SET_CURRENT(cur_time);
+			INSTR_TIME_SUBTRACT(cur_time, start_time);
+			cur_timeout = timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
+			if (cur_timeout <= 0)
+			{
+				/* Timeout has expired, no need to continue looping */
+				result |= WL_TIMEOUT;
+			}
+		}
+	} while (result == 0);
+
+	/* Clean up the event object we created for the socket */
+	if (sockevent != WSA_INVALID_EVENT)
+	{
+		WSAEventSelect(sock, NULL, 0);
+		WSACloseEvent(sockevent);
+	}
+
+	return result;
+}
+#endif
 
 /*
  * Sets a latch and wakes up anyone waiting on it.
@@ -567,7 +794,11 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 void
 SetLatch(volatile Latch *latch)
 {
+#ifndef WIN32
 	pid_t		owner_pid;
+#else
+	HANDLE		handle;
+#endif
 
 	/*
 	 * The memory barrier has be to be placed here to ensure that any flag
@@ -582,6 +813,8 @@ SetLatch(volatile Latch *latch)
 
 	latch->is_set = true;
 
+#ifndef WIN32
+
 	/*
 	 * See if anyone's waiting for the latch. It can be the current process if
 	 * we're in a signal handler. We use the self-pipe to wake up the select()
@@ -613,6 +846,27 @@ SetLatch(volatile Latch *latch)
 	}
 	else
 		kill(owner_pid, SIGUSR1);
+#else
+
+	/*
+	 * See if anyone's waiting for the latch. It can be the current process if
+	 * we're in a signal handler.
+	 *
+	 * Use a local variable here just in case somebody changes the event field
+	 * concurrently (which really should not happen).
+	 */
+	handle = latch->event;
+	if (handle)
+	{
+		SetEvent(handle);
+
+		/*
+		 * Note that we silently ignore any errors. We might be in a signal
+		 * handler or other critical path where it's not safe to call elog().
+		 */
+	}
+#endif
+
 }
 
 /*
@@ -646,14 +900,17 @@ ResetLatch(volatile Latch *latch)
  * NB: when calling this in a signal handler, be sure to save and restore
  * errno around it.
  */
+#ifndef WIN32
 void
 latch_sigusr1_handler(void)
 {
 	if (waiting)
 		sendSelfPipeByte();
 }
+#endif   /* !WIN32 */
 
 /* Send one byte to the self-pipe, to wake up WaitLatch */
+#ifndef WIN32
 static void
 sendSelfPipeByte(void)
 {
@@ -683,6 +940,7 @@ retry:
 		return;
 	}
 }
+#endif   /* !WIN32 */
 
 /*
  * Read all available data from the self-pipe
@@ -691,6 +949,7 @@ retry:
  * return, it must reset that flag first (though ideally, this will never
  * happen).
  */
+#ifndef WIN32
 static void
 drainSelfPipe(void)
 {
@@ -729,3 +988,4 @@ drainSelfPipe(void)
 		/* else buffer wasn't big enough, so read again */
 	}
 }
+#endif   /* !WIN32 */
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index e77491e..2719498 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -36,7 +36,7 @@
  * WaitLatch includes a provision for timeouts (which should be avoided
  * when possible, as they incur extra overhead) and a provision for
  * postmaster child processes to wake up immediately on postmaster death.
- * See unix_latch.c for detailed specifications for the exported functions.
+ * See latch.c for detailed specifications for the exported functions.
  *
  * The correct pattern to wait for event(s) is:
  *
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index 949077a..b6e4577 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -134,8 +134,6 @@ sub mkvcbuild
 		'src/backend/port/win32_sema.c');
 	$postgres->ReplaceFile('src/backend/port/pg_shmem.c',
 		'src/backend/port/win32_shmem.c');
-	$postgres->ReplaceFile('src/backend/port/pg_latch.c',
-		'src/backend/port/win32_latch.c');
 	$postgres->AddFiles('src/port',   @pgportfiles);
 	$postgres->AddFiles('src/common', @pgcommonbkndfiles);
 	$postgres->AddDir('src/timezone');
-- 
2.7.0.229.g701fa7f

0005-WIP-Introduce-new-WaitEventSet-API.patchtext/x-patch; charset=us-asciiDownload

From 35d645265abd0ff6d03ef246bc30bf3edc268439 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 18 Mar 2016 00:54:41 -0700
Subject: [PATCH 5/5] WIP: Introduce new WaitEventSet API.

Commit ac1d794 ("Make idle backends exit if the postmaster dies.")
introduced a regression on, at least, large linux systems. Constantly
adding the same postmaster_alive_fds to the OSs internal datastructures
for implementing poll/select can cause significant contention; leading
to a performance regression of nearly 3x in one example.

This can be avoided by using e.g. linux' epoll, which avoids having to
add/remove file descriptors to the wait datastructures at a high rate.
Unfortunately the current latch interface makes it hard to allocate any
persistent per-backend resources.

Replace, with a backward compatibility layer, WaitLatchOrSocket with a
new WaitEventSet API. Users can allocate such a Set across multiple
calls, and add more than one filedescriptor to wait on. The latter has
been added because there's upcoming postgres features where that will be
helpful.

In addition to the previously existing poll(2), select(2),
WaitForMultipleObjects() implementations also provide an epoll_wait(2)
based implementation to address the aforementioned performance
problem. Epoll is only available on linux, but that is the most likely
OS for machines large enough (four sockets) to reproduce the problem.

Todo:
* Testing, especially windows
* Documentation

Reported-By: Dmitry Vasilyev
Discussion: CAB-SwXZh44_2ybvS5Z67p_CDz=XFn4hNAD=CnMEF+QqkXwFrGg@mail.gmail.com
    20160114143931.GG10941@awork2.anarazel.de
---
 configure                         |    2 +-
 configure.in                      |    2 +-
 src/backend/libpq/be-secure.c     |   24 +-
 src/backend/libpq/pqcomm.c        |    4 +
 src/backend/storage/ipc/latch.c   | 1540 +++++++++++++++++++++++++------------
 src/backend/utils/init/miscinit.c |    8 +
 src/include/libpq/libpq.h         |    3 +
 src/include/pg_config.h.in        |    3 +
 src/include/storage/latch.h       |   14 +
 src/tools/pgindent/typedefs.list  |    2 +
 10 files changed, 1084 insertions(+), 518 deletions(-)

diff --git a/configure b/configure
index c10d954..24655dc 100755
--- a/configure
+++ b/configure
@@ -10193,7 +10193,7 @@ fi
 ## Header files
 ##
 
-for ac_header in atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h pwd.h sys/ioctl.h sys/ipc.h sys/poll.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/socket.h sys/sockio.h sys/tas.h sys/time.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h
+for ac_header in atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h pwd.h sys/epoll.h sys/ioctl.h sys/ipc.h sys/poll.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/socket.h sys/sockio.h sys/tas.h sys/time.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h
 do :
   as_ac_Header=`$as_echo "ac_cv_header_$ac_header" | $as_tr_sh`
 ac_fn_c_check_header_mongrel "$LINENO" "$ac_header" "$as_ac_Header" "$ac_includes_default"
diff --git a/configure.in b/configure.in
index 47d0f58..c564a76 100644
--- a/configure.in
+++ b/configure.in
@@ -1183,7 +1183,7 @@ AC_SUBST(UUID_LIBS)
 ##
 
 dnl sys/socket.h is required by AC_FUNC_ACCEPT_ARGTYPES
-AC_CHECK_HEADERS([atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h pwd.h sys/ioctl.h sys/ipc.h sys/poll.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/socket.h sys/sockio.h sys/tas.h sys/time.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h])
+AC_CHECK_HEADERS([atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h pwd.h sys/epoll.h sys/ioctl.h sys/ipc.h sys/poll.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/socket.h sys/sockio.h sys/tas.h sys/time.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h])
 
 # On BSD, test for net/if.h will fail unless sys/socket.h
 # is included first.
diff --git a/src/backend/libpq/be-secure.c b/src/backend/libpq/be-secure.c
index ac709d1..c396811 100644
--- a/src/backend/libpq/be-secure.c
+++ b/src/backend/libpq/be-secure.c
@@ -140,13 +140,13 @@ retry:
 	/* In blocking mode, wait until the socket is ready */
 	if (n < 0 && !port->noblock && (errno == EWOULDBLOCK || errno == EAGAIN))
 	{
-		int			w;
+		WaitEvent   event;
 
 		Assert(waitfor);
 
-		w = WaitLatchOrSocket(MyLatch,
-							  WL_LATCH_SET | WL_POSTMASTER_DEATH | waitfor,
-							  port->sock, 0);
+		ModifyWaitEvent(FeBeWaitSet, 0, waitfor, NULL);
+
+		WaitEventSetWait(FeBeWaitSet, 0 /* no timeout */, &event, 1);
 
 		/*
 		 * If the postmaster has died, it's not safe to continue running,
@@ -165,13 +165,13 @@ retry:
 		 * cycles checking for this very rare condition, and this should cause
 		 * us to exit quickly in most cases.)
 		 */
-		if (w & WL_POSTMASTER_DEATH)
+		if (event.events & WL_POSTMASTER_DEATH)
 			ereport(FATAL,
 					(errcode(ERRCODE_ADMIN_SHUTDOWN),
 					errmsg("terminating connection due to unexpected postmaster exit")));
 
 		/* Handle interrupt. */
-		if (w & WL_LATCH_SET)
+		if (event.events & WL_LATCH_SET)
 		{
 			ResetLatch(MyLatch);
 			ProcessClientReadInterrupt(true);
@@ -241,22 +241,22 @@ retry:
 
 	if (n < 0 && !port->noblock && (errno == EWOULDBLOCK || errno == EAGAIN))
 	{
-		int			w;
+		WaitEvent   event;
 
 		Assert(waitfor);
 
-		w = WaitLatchOrSocket(MyLatch,
-							  WL_LATCH_SET | WL_POSTMASTER_DEATH | waitfor,
-							  port->sock, 0);
+		ModifyWaitEvent(FeBeWaitSet, 0, waitfor, NULL);
+
+		WaitEventSetWait(FeBeWaitSet, 0 /* no timeout */, &event, 1);
 
 		/* See comments in secure_read. */
-		if (w & WL_POSTMASTER_DEATH)
+		if (event.events & WL_POSTMASTER_DEATH)
 			ereport(FATAL,
 					(errcode(ERRCODE_ADMIN_SHUTDOWN),
 					errmsg("terminating connection due to unexpected postmaster exit")));
 
 		/* Handle interrupt. */
-		if (w & WL_LATCH_SET)
+		if (event.events & WL_LATCH_SET)
 		{
 			ResetLatch(MyLatch);
 			ProcessClientWriteInterrupt(true);
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 71473db..c81abaf 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -201,6 +201,10 @@ pq_init(void)
 				(errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
+	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+	AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock, NULL);
+	AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch);
+	AddWaitEventToSet(FeBeWaitSet, WL_POSTMASTER_DEATH, -1, NULL);
 }
 
 /* --------------------------------
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 143d2a1..0759398 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -41,6 +41,9 @@
 #include <unistd.h>
 #include <sys/time.h>
 #include <sys/types.h>
+#ifdef HAVE_SYS_EPOLL_H
+#include <sys/epoll.h>
+#endif
 #ifdef HAVE_POLL_H
 #include <poll.h>
 #endif
@@ -65,18 +68,38 @@
  * useful to manually specify the used primitive.  If desired, just add a
  * define somewhere before this block.
  */
-#if defined(LATCH_USE_POLL) || defined(LATCH_USE_SELECT) || defined(LATCH_USE_WIN32)
+#if defined(WAIT_USE_EPOLL) || defined(WAIT_USE_POLL) || defined(WAIT_USE_SELECT) || defined(WAIT_USE_WIN32)
 /* don't overwrite manual choice */
+#elif defined(HAVE_SYS_EPOLL_H)
+#define WAIT_USE_EPOLL
 #elif defined(HAVE_POLL)
-#define LATCH_USE_POLL
+#define WAIT_USE_POLL
 #elif HAVE_SYS_SELECT_H
-#define LATCH_USE_SELECT
+#define WAIT_USE_SELECT
 #elif WIN32
-#define LATCH_USE_WIN32
+#define WAIT_USE_WIN32
 #else
-#error "no latch implementation available"
+#error "no wait set implementation available"
 #endif
 
+typedef struct WaitEventSet
+{
+	int			nevents;
+	int			nevents_space;
+	Latch	   *latch;
+	int			latch_pos;
+	WaitEvent  *events;
+#if defined(WAIT_USE_EPOLL)
+	struct epoll_event *epoll_ret_events;
+	int			epoll_fd;
+#elif defined(WAIT_USE_POLL)
+	struct pollfd *pollfds;
+#endif
+#if defined(WAIT_USE_WIN32)
+	HANDLE	   *handles;
+#endif
+} WaitEventSet;
+
 #ifndef WIN32
 /* Are we currently in WaitLatch? The signal handler would like to know. */
 static volatile sig_atomic_t waiting = false;
@@ -90,6 +113,16 @@ static void sendSelfPipeByte(void);
 static void drainSelfPipe(void);
 #endif   /* WIN32 */
 
+#if defined(WAIT_USE_EPOLL)
+static void WaitEventAdjustEpoll(WaitEventSet *set, WaitEvent *event, int action);
+#elif defined(WAIT_USE_POLL)
+static void WaitEventAdjustPoll(WaitEventSet *set, WaitEvent *event);
+#elif defined(WAIT_USE_WIN32)
+static void WaitEventAdjustWin32(WaitEventSet *set, WaitEvent *event);
+#endif
+
+static int WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
+								 WaitEvent *occurred_events, int nevents);
 
 /*
  * Initialize the process-local latch infrastructure.
@@ -255,529 +288,56 @@ WaitLatch(volatile Latch *latch, int wakeEvents, long timeout)
  * returning the socket as readable/writable or both, depending on
  * WL_SOCKET_READABLE/WL_SOCKET_WRITEABLE being specified.
  */
-#ifndef WIN32
 int
 WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 				  long timeout)
 {
-	int			result = 0;
+	int			ret = 0;
 	int			rc;
-	instr_time	start_time,
-				cur_time;
-	long		cur_timeout;
+	WaitEvent	event;
+	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
 
-#if defined(LATCH_USE_POLL)
-	struct pollfd pfds[3];
-	int			nfds;
-#elif defined(LATCH_USE_SELECT)
-	struct timeval tv,
-			   *tvp;
-	fd_set		input_mask;
-	fd_set		output_mask;
-	int			hifd;
-#endif
-
-	Assert(wakeEvents != 0);	/* must have at least one wake event */
+	if (wakeEvents & WL_TIMEOUT)
+		Assert(timeout >= 0);
+	else
+		timeout = -1;
 
 	/* waiting for socket readiness without a socket indicates a bug */
 	if (sock == PGINVALID_SOCKET &&
 		(wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
 		elog(ERROR, "cannot wait on socket event without a socket");
 
-	if ((wakeEvents & WL_LATCH_SET) && latch->owner_pid != MyProcPid)
-		elog(ERROR, "cannot wait on a latch owned by another process");
+	if (wakeEvents & WL_LATCH_SET)
+		AddWaitEventToSet(set, WL_LATCH_SET, PGINVALID_SOCKET,
+						  (Latch *) latch);
 
-	/*
-	 * Initialize timeout if requested.  We must record the current time so
-	 * that we can determine the remaining timeout if the poll() or select()
-	 * is interrupted.  (On some platforms, select() will update the contents
-	 * of "tv" for us, but unfortunately we can't rely on that.)
-	 */
-	if (wakeEvents & WL_TIMEOUT)
-	{
-		INSTR_TIME_SET_CURRENT(start_time);
-		Assert(timeout >= 0 && timeout <= INT_MAX);
-		cur_timeout = timeout;
+	if (wakeEvents & WL_POSTMASTER_DEATH)
+		AddWaitEventToSet(set, WL_POSTMASTER_DEATH, PGINVALID_SOCKET, NULL);
 
-#ifdef LATCH_USE_SELECT
-		tv.tv_sec = cur_timeout / 1000L;
-		tv.tv_usec = (cur_timeout % 1000L) * 1000L;
-		tvp = &tv;
-#endif
-	}
-	else
-	{
-		cur_timeout = -1;
-
-#ifdef LATCH_USE_SELECT
-		tvp = NULL;
-#endif
-	}
-
-	waiting = true;
-	do
-	{
-		/*
-		 * Check if the latch is set already. If so, leave loop immediately,
-		 * avoid blocking again. We don't attempt to report any other events
-		 * that might also be satisfied.
-		 *
-		 * If someone sets the latch between this and the poll()/select()
-		 * below, the setter will write a byte to the pipe (or signal us and
-		 * the signal handler will do that), and the poll()/select() will
-		 * return immediately.
-		 *
-		 * If there's a pending byte in the self pipe, we'll notice whenever
-		 * blocking. Only clearing the pipe in that case avoids having to
-		 * drain it every time WaitLatchOrSocket() is used. Should the
-		 * pipe-buffer fill up we're still ok, because the pipe is in
-		 * nonblocking mode. It's unlikely for that to happen, because the
-		 * self pipe isn't filled unless we're blocking (waiting = true), or
-		 * from inside a signal handler in latch_sigusr1_handler().
-		 *
-		 * Note: we assume that the kernel calls involved in drainSelfPipe()
-		 * and SetLatch() will provide adequate synchronization on machines
-		 * with weak memory ordering, so that we cannot miss seeing is_set if
-		 * the signal byte is already in the pipe when we drain it.
-		 */
-		if ((wakeEvents & WL_LATCH_SET) && latch->is_set)
-		{
-			result |= WL_LATCH_SET;
-			break;
-		}
-
-		/*
-		 * Must wait ... we use the polling interface determined at the top of
-		 * this file to do so.
-		 */
-#if defined(LATCH_USE_POLL)
-		nfds = 0;
-
-		/* selfpipe is always in pfds[0] */
-		pfds[0].fd = selfpipe_readfd;
-		pfds[0].events = POLLIN;
-		pfds[0].revents = 0;
-		nfds++;
-
-		if (wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))
-		{
-			/* socket, if used, is always in pfds[1] */
-			pfds[1].fd = sock;
-			pfds[1].events = 0;
-			if (wakeEvents & WL_SOCKET_READABLE)
-				pfds[1].events |= POLLIN;
-			if (wakeEvents & WL_SOCKET_WRITEABLE)
-				pfds[1].events |= POLLOUT;
-			pfds[1].revents = 0;
-			nfds++;
-		}
-
-		if (wakeEvents & WL_POSTMASTER_DEATH)
-		{
-			/* postmaster fd, if used, is always in pfds[nfds - 1] */
-			pfds[nfds].fd = postmaster_alive_fds[POSTMASTER_FD_WATCH];
-			pfds[nfds].events = POLLIN;
-			pfds[nfds].revents = 0;
-			nfds++;
-		}
-
-		/* Sleep */
-		rc = poll(pfds, nfds, (int) cur_timeout);
-
-		/* Check return code */
-		if (rc < 0)
-		{
-			/* EINTR is okay, otherwise complain */
-			if (errno != EINTR)
-			{
-				waiting = false;
-				ereport(ERROR,
-						(errcode_for_socket_access(),
-						 errmsg("poll() failed: %m")));
-			}
-		}
-		else if (rc == 0)
-		{
-			/* timeout exceeded */
-			if (wakeEvents & WL_TIMEOUT)
-				result |= WL_TIMEOUT;
-		}
-		else
-		{
-			/* at least one event occurred, so check revents values */
-
-			if (pfds[0].revents & POLLIN)
-			{
-				/* There's data in the self-pipe, clear it. */
-				drainSelfPipe();
-			}
-
-			if ((wakeEvents & WL_SOCKET_READABLE) &&
-				(pfds[1].revents & POLLIN))
-			{
-				/* data available in socket, or EOF/error condition */
-				result |= WL_SOCKET_READABLE;
-			}
-			if ((wakeEvents & WL_SOCKET_WRITEABLE) &&
-				(pfds[1].revents & POLLOUT))
-			{
-				/* socket is writable */
-				result |= WL_SOCKET_WRITEABLE;
-			}
-			if ((wakeEvents & WL_SOCKET_WRITEABLE) &&
-				(pfds[1].revents & (POLLHUP | POLLERR | POLLNVAL)))
-			{
-				/* EOF/error condition */
-				if (wakeEvents & WL_SOCKET_READABLE)
-					result |= WL_SOCKET_READABLE;
-				if (wakeEvents & WL_SOCKET_WRITEABLE)
-					result |= WL_SOCKET_WRITEABLE;
-			}
-
-			/*
-			 * We expect a POLLHUP when the remote end is closed, but because
-			 * we don't expect the pipe to become readable or to have any
-			 * errors either, treat those cases as postmaster death, too.
-			 */
-			if ((wakeEvents & WL_POSTMASTER_DEATH) &&
-				(pfds[nfds - 1].revents & (POLLHUP | POLLIN | POLLERR | POLLNVAL)))
-			{
-				/*
-				 * According to the select(2) man page on Linux, select(2) may
-				 * spuriously return and report a file descriptor as readable,
-				 * when it's not; and presumably so can poll(2).  It's not
-				 * clear that the relevant cases would ever apply to the
-				 * postmaster pipe, but since the consequences of falsely
-				 * returning WL_POSTMASTER_DEATH could be pretty unpleasant,
-				 * we take the trouble to positively verify EOF with
-				 * PostmasterIsAlive().
-				 */
-				if (!PostmasterIsAlive())
-					result |= WL_POSTMASTER_DEATH;
-			}
-		}
-#elif defined(LATCH_USE_SELECT)
-
-		/*
-		 * On at least older linux kernels select(), in violation of POSIX,
-		 * doesn't reliably return a socket as writable if closed - but we
-		 * rely on that. So far all the known cases of this problem are on
-		 * platforms that also provide a poll() implementation without that
-		 * bug.  If we find one where that's not the case, we'll need to add a
-		 * workaround.
-		 */
-		FD_ZERO(&input_mask);
-		FD_ZERO(&output_mask);
-
-		FD_SET(selfpipe_readfd, &input_mask);
-		hifd = selfpipe_readfd;
-
-		if (wakeEvents & WL_POSTMASTER_DEATH)
-		{
-			FD_SET(postmaster_alive_fds[POSTMASTER_FD_WATCH], &input_mask);
-			if (postmaster_alive_fds[POSTMASTER_FD_WATCH] > hifd)
-				hifd = postmaster_alive_fds[POSTMASTER_FD_WATCH];
-		}
-
-		if (wakeEvents & WL_SOCKET_READABLE)
-		{
-			FD_SET(sock, &input_mask);
-			if (sock > hifd)
-				hifd = sock;
-		}
-
-		if (wakeEvents & WL_SOCKET_WRITEABLE)
-		{
-			FD_SET(sock, &output_mask);
-			if (sock > hifd)
-				hifd = sock;
-		}
-
-		/* Sleep */
-		rc = select(hifd + 1, &input_mask, &output_mask, NULL, tvp);
-
-		/* Check return code */
-		if (rc < 0)
-		{
-			/* EINTR is okay, otherwise complain */
-			if (errno != EINTR)
-			{
-				waiting = false;
-				ereport(ERROR,
-						(errcode_for_socket_access(),
-						 errmsg("select() failed: %m")));
-			}
-		}
-		else if (rc == 0)
-		{
-			/* timeout exceeded */
-			if (wakeEvents & WL_TIMEOUT)
-				result |= WL_TIMEOUT;
-		}
-		else
-		{
-			/* at least one event occurred, so check masks */
-			if (FD_ISSET(selfpipe_readfd, &input_mask))
-			{
-				/* There's data in the self-pipe, clear it. */
-				drainSelfPipe();
-			}
-			if ((wakeEvents & WL_SOCKET_READABLE) && FD_ISSET(sock, &input_mask))
-			{
-				/* data available in socket, or EOF */
-				result |= WL_SOCKET_READABLE;
-			}
-			if ((wakeEvents & WL_SOCKET_WRITEABLE) && FD_ISSET(sock, &output_mask))
-			{
-				/* socket is writable, or EOF */
-				result |= WL_SOCKET_WRITEABLE;
-			}
-			if ((wakeEvents & WL_POSTMASTER_DEATH) &&
-				FD_ISSET(postmaster_alive_fds[POSTMASTER_FD_WATCH],
-						 &input_mask))
-			{
-				/*
-				 * According to the select(2) man page on Linux, select(2) may
-				 * spuriously return and report a file descriptor as readable,
-				 * when it's not; and presumably so can poll(2).  It's not
-				 * clear that the relevant cases would ever apply to the
-				 * postmaster pipe, but since the consequences of falsely
-				 * returning WL_POSTMASTER_DEATH could be pretty unpleasant,
-				 * we take the trouble to positively verify EOF with
-				 * PostmasterIsAlive().
-				 */
-				if (!PostmasterIsAlive())
-					result |= WL_POSTMASTER_DEATH;
-			}
-		}
-#endif   /* LATCH_USE_SELECT */
-
-		/*
-		 * Check again whether latch is set, the arrival of a signal/self-byte
-		 * might be what stopped our sleep. It's not required for correctness
-		 * to signal the latch as being set (we'd just loop if there's no
-		 * other event), but it seems good to report an arrived latch asap.
-		 * This way we also don't have to compute the current timestamp again.
-		 */
-		if ((wakeEvents & WL_LATCH_SET) && latch->is_set)
-			result |= WL_LATCH_SET;
-
-		/* If we're not done, update cur_timeout for next iteration */
-		if (result == 0 && (wakeEvents & WL_TIMEOUT))
-		{
-			INSTR_TIME_SET_CURRENT(cur_time);
-			INSTR_TIME_SUBTRACT(cur_time, start_time);
-			cur_timeout = timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
-			if (cur_timeout <= 0)
-			{
-				/* Timeout has expired, no need to continue looping */
-				result |= WL_TIMEOUT;
-			}
-#ifdef LATCH_USE_SELECT
-			else
-			{
-				tv.tv_sec = cur_timeout / 1000L;
-				tv.tv_usec = (cur_timeout % 1000L) * 1000L;
-			}
-#endif
-		}
-	} while (result == 0);
-	waiting = false;
-
-	return result;
-}
-#else
-int
-WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
-				  long timeout)
-{
-	DWORD		rc;
-	instr_time	start_time,
-				cur_time;
-	long		cur_timeout;
-	HANDLE		events[4];
-	HANDLE		latchevent;
-	HANDLE		sockevent = WSA_INVALID_EVENT;
-	int			numevents;
-	int			result = 0;
-	int			pmdeath_eventno = 0;
-
-	Assert(wakeEvents != 0);	/* must have at least one wake event */
-
-	/* waiting for socket readiness without a socket indicates a bug */
-	if (sock == PGINVALID_SOCKET &&
-		(wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
-		elog(ERROR, "cannot wait on socket events without a socket");
-
-	if ((wakeEvents & WL_LATCH_SET) && latch->owner_pid != MyProcPid)
-		elog(ERROR, "cannot wait on a latch owned by another process");
-
-	/*
-	 * Initialize timeout if requested.  We must record the current time so
-	 * that we can determine the remaining timeout if WaitForMultipleObjects
-	 * is interrupted.
-	 */
-	if (wakeEvents & WL_TIMEOUT)
-	{
-		INSTR_TIME_SET_CURRENT(start_time);
-		Assert(timeout >= 0 && timeout <= INT_MAX);
-		cur_timeout = timeout;
-	}
-	else
-		cur_timeout = INFINITE;
-
-	/*
-	 * Construct an array of event handles for WaitforMultipleObjects().
-	 *
-	 * Note: pgwin32_signal_event should be first to ensure that it will be
-	 * reported when multiple events are set.  We want to guarantee that
-	 * pending signals are serviced.
-	 */
-	latchevent = latch->event;
-
-	events[0] = pgwin32_signal_event;
-	events[1] = latchevent;
-	numevents = 2;
 	if (wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))
 	{
-		/* Need an event object to represent events on the socket */
-		int			flags = FD_CLOSE;	/* always check for errors/EOF */
+		int			ev;
 
-		if (wakeEvents & WL_SOCKET_READABLE)
-			flags |= FD_READ;
-		if (wakeEvents & WL_SOCKET_WRITEABLE)
-			flags |= FD_WRITE;
-
-		sockevent = WSACreateEvent();
-		if (sockevent == WSA_INVALID_EVENT)
-			elog(ERROR, "failed to create event for socket: error code %u",
-				 WSAGetLastError());
-		if (WSAEventSelect(sock, sockevent, flags) != 0)
-			elog(ERROR, "failed to set up event for socket: error code %u",
-				 WSAGetLastError());
-
-		events[numevents++] = sockevent;
-	}
-	if (wakeEvents & WL_POSTMASTER_DEATH)
-	{
-		pmdeath_eventno = numevents;
-		events[numevents++] = PostmasterHandle;
+		ev = wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE);
+		AddWaitEventToSet(set, ev, sock, NULL);
 	}
 
-	/* Ensure that signals are serviced even if latch is already set */
-	pgwin32_dispatch_queued_signals();
+	rc = WaitEventSetWait(set, timeout, &event, 1);
 
-	do
+	if (rc == 0)
+		ret |= WL_TIMEOUT;
+	else
 	{
-		/*
-		 * Reset the event, and check if the latch is set already. If someone
-		 * sets the latch between this and the WaitForMultipleObjects() call
-		 * below, the setter will set the event and WaitForMultipleObjects()
-		 * will return immediately.
-		 */
-		if (!ResetEvent(latchevent))
-			elog(ERROR, "ResetEvent failed: error code %lu", GetLastError());
-
-		if ((wakeEvents & WL_LATCH_SET) && latch->is_set)
-		{
-			result |= WL_LATCH_SET;
-
-			/*
-			 * Leave loop immediately, avoid blocking again. We don't attempt
-			 * to report any other events that might also be satisfied.
-			 */
-			break;
-		}
-
-		rc = WaitForMultipleObjects(numevents, events, FALSE, cur_timeout);
-
-		if (rc == WAIT_FAILED)
-			elog(ERROR, "WaitForMultipleObjects() failed: error code %lu",
-				 GetLastError());
-		else if (rc == WAIT_TIMEOUT)
-		{
-			result |= WL_TIMEOUT;
-		}
-		else if (rc == WAIT_OBJECT_0)
-		{
-			/* Service newly-arrived signals */
-			pgwin32_dispatch_queued_signals();
-		}
-		else if (rc == WAIT_OBJECT_0 + 1)
-		{
-			/*
-			 * Latch is set.  We'll handle that on next iteration of loop, but
-			 * let's not waste the cycles to update cur_timeout below.
-			 */
-			continue;
-		}
-		else if ((wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) &&
-				 rc == WAIT_OBJECT_0 + 2)		/* socket is at event slot 2 */
-		{
-			WSANETWORKEVENTS resEvents;
-
-			ZeroMemory(&resEvents, sizeof(resEvents));
-			if (WSAEnumNetworkEvents(sock, sockevent, &resEvents) != 0)
-				elog(ERROR, "failed to enumerate network events: error code %u",
-					 WSAGetLastError());
-			if ((wakeEvents & WL_SOCKET_READABLE) &&
-				(resEvents.lNetworkEvents & FD_READ))
-			{
-				result |= WL_SOCKET_READABLE;
-			}
-			if ((wakeEvents & WL_SOCKET_WRITEABLE) &&
-				(resEvents.lNetworkEvents & FD_WRITE))
-			{
-				result |= WL_SOCKET_WRITEABLE;
-			}
-			if (resEvents.lNetworkEvents & FD_CLOSE)
-			{
-				if (wakeEvents & WL_SOCKET_READABLE)
-					result |= WL_SOCKET_READABLE;
-				if (wakeEvents & WL_SOCKET_WRITEABLE)
-					result |= WL_SOCKET_WRITEABLE;
-			}
-		}
-		else if ((wakeEvents & WL_POSTMASTER_DEATH) &&
-				 rc == WAIT_OBJECT_0 + pmdeath_eventno)
-		{
-			/*
-			 * Postmaster apparently died.  Since the consequences of falsely
-			 * returning WL_POSTMASTER_DEATH could be pretty unpleasant, we
-			 * take the trouble to positively verify this with
-			 * PostmasterIsAlive(), even though there is no known reason to
-			 * think that the event could be falsely set on Windows.
-			 */
-			if (!PostmasterIsAlive())
-				result |= WL_POSTMASTER_DEATH;
-		}
-		else
-			elog(ERROR, "unexpected return code from WaitForMultipleObjects(): %lu", rc);
-
-		/* If we're not done, update cur_timeout for next iteration */
-		if (result == 0 && (wakeEvents & WL_TIMEOUT))
-		{
-			INSTR_TIME_SET_CURRENT(cur_time);
-			INSTR_TIME_SUBTRACT(cur_time, start_time);
-			cur_timeout = timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
-			if (cur_timeout <= 0)
-			{
-				/* Timeout has expired, no need to continue looping */
-				result |= WL_TIMEOUT;
-			}
-		}
-	} while (result == 0);
-
-	/* Clean up the event object we created for the socket */
-	if (sockevent != WSA_INVALID_EVENT)
-	{
-		WSAEventSelect(sock, NULL, 0);
-		WSACloseEvent(sockevent);
+		ret |= event.events & (WL_LATCH_SET |
+							   WL_POSTMASTER_DEATH |
+							   WL_SOCKET_READABLE |
+							   WL_SOCKET_WRITEABLE);
 	}
 
-	return result;
+	FreeWaitEventSet(set);
+
+	return ret;
 }
-#endif
 
 /*
  * Sets a latch and wakes up anyone waiting on it.
@@ -891,6 +451,978 @@ ResetLatch(volatile Latch *latch)
 }
 
 /*
+ * Create a WaitEventSet with space for nevents different events to wait for.
+ *
+ * latch may be NULL.
+ */
+WaitEventSet *
+CreateWaitEventSet(MemoryContext context, int nevents)
+{
+	WaitEventSet *set;
+	char	   *data;
+	Size		sz = 0;
+
+	sz += sizeof(WaitEventSet);
+	sz += sizeof(WaitEvent) * nevents;
+
+#if defined(WAIT_USE_EPOLL)
+	sz += sizeof(struct epoll_event) * nevents;
+#elif defined(WAIT_USE_POLL)
+	sz += sizeof(struct pollfd) * nevents;
+#elif defined(WAIT_USE_WIN32)
+	/* need space for the pgwin32_signal_event */
+	sz += sizeof(HANDLE) * (nevents + 1);
+#endif
+
+	data = (char *) MemoryContextAllocZero(context, sz);
+
+	set = (WaitEventSet *) data;
+	data += sizeof(WaitEventSet);
+
+	set->events = (WaitEvent *) data;
+	data += sizeof(WaitEvent) * nevents;
+
+#if defined(WAIT_USE_EPOLL)
+	set->epoll_ret_events = (struct epoll_event *) data;
+	data += sizeof(struct epoll_event) * nevents;
+#elif defined(WAIT_USE_POLL)
+	set->pollfds = (struct pollfd *) data;
+	data += sizeof(struct pollfd) * nevents;
+#elif defined(WAIT_USE_WIN32)
+	set->handles = (HANDLE) data;
+	data += sizeof(HANDLE) * nevents;
+#endif
+
+	set->latch = NULL;
+	set->nevents_space = nevents;
+
+#if defined(WAIT_USE_EPOLL)
+	set->epoll_fd = epoll_create(nevents);
+	if (set->epoll_fd < 0)
+		elog(ERROR, "epoll_create failed: %m");
+#elif defined(WAIT_USE_WIN32)
+
+	/*
+	 * To handle signals while waiting, we need to add a win32 specific event.
+	 * We accounted for the additional event at the top of this routine. See
+	 * port/win32/signal.c for more details.
+	 *
+	 * Note: pgwin32_signal_event should be first to ensure that it will be
+	 * reported when multiple events are set.  We want to guarantee that
+	 * pending signals are serviced.
+	 */
+	set->handles[0] = pgwin32_signal_event;
+#endif
+
+	return set;
+}
+
+/*
+ * Free a previously created WaitEventSet.
+ */
+void
+FreeWaitEventSet(WaitEventSet *set)
+{
+#if defined(WAIT_USE_EPOLL)
+	close(set->epoll_fd);
+#elif defined(WAIT_USE_WIN32)
+	WaitEvent  *cur_event;
+
+	for (cur_event = set->events;
+		 cur_event < (cur_event + set->nevents);
+		 cur_event++)
+	{
+		if (cur_event->events & WL_LATCH_SET)
+		{
+			/* uses the latch's HANDLE */
+		}
+		else if (cur_event->events & WL_POSTMASTER_DEATH)
+		{
+			/* uses PostmasterHandle */
+		}
+		else
+		{
+			/* Clean up the event object we created for the socket */
+			WSAEventSelect(cur_event->fd, NULL, 0);
+			WSACloseEvent(set->handles[cur_event->pos + 1]);
+		}
+	}
+#endif
+
+	pfree(set);
+}
+
+/* ---
+ * Add an event to the set. Possible events are:
+ * - WL_LATCH_SET: Wait for the latch to be set
+ * - WL_POSTMASTER_DEATH: Wait for postmaster to die
+ * - WL_SOCKET_READABLE: Wait for socket to become readable
+ *	 can be combined in one event with WL_SOCKET_WRITEABLE
+ * - WL_SOCKET_WRITEABLE: Wait for socket to become readable
+ *	 can be combined with WL_SOCKET_READABLE
+ *
+ * Returns the offset in WaitEventSet->events (starting from 0), which can be
+ * used to modify previously added wait events.
+ */
+int
+AddWaitEventToSet(WaitEventSet *set, uint32 events, int fd, Latch *latch)
+{
+	WaitEvent  *event;
+
+	if (set->nevents_space <= set->nevents)
+		elog(ERROR, "no space for yet another event");
+
+	if (set->latch && latch)
+		elog(ERROR, "cannot wait on more than one latch");
+
+	if (latch == NULL && (events & WL_LATCH_SET))
+		elog(ERROR, "cannot wait on latch without a specified latch");
+
+	/* waiting for socket readiness without a socket indicates a bug */
+	if (fd == PGINVALID_SOCKET &&
+		(events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)))
+		elog(ERROR, "cannot wait on socket events without a socket");
+
+	/* FIXME: further event mask validation */
+
+	event = &set->events[set->nevents];
+	event->pos = set->nevents++;
+	event->fd = fd;
+	event->events = events;
+
+	if (events == WL_LATCH_SET)
+	{
+		set->latch = latch;
+		set->latch_pos = event->pos;
+#ifndef WIN32
+		event->fd = selfpipe_readfd;
+#endif
+	}
+	else if (events == WL_POSTMASTER_DEATH)
+	{
+#ifndef WIN32
+		event->fd = postmaster_alive_fds[POSTMASTER_FD_WATCH];
+#endif
+	}
+
+#if defined(WAIT_USE_EPOLL)
+	WaitEventAdjustEpoll(set, event, EPOLL_CTL_ADD);
+#elif defined(WAIT_USE_POLL)
+	WaitEventAdjustPoll(set, event);
+#elif defined(WAIT_USE_SELECT)
+	/* nothing to do */
+#elif defined(WAIT_USE_WIN32)
+	WaitEventAdjustWin32(set, event);
+#endif
+
+	return event->pos;
+}
+
+/*
+ * Change the event mask and, if applicable, the associated latch of a
+ * WaitEvent.
+ */
+void
+ModifyWaitEvent(WaitEventSet *set, int pos, uint32 events, Latch *latch)
+{
+	WaitEvent  *event;
+
+	Assert(pos < set->nevents);
+
+	event = &set->events[pos];
+
+	/* no need to perform any checks/modifications */
+	if (events == event->events && !(event->events & WL_LATCH_SET))
+		return;
+
+	if (event->events & WL_LATCH_SET &&
+		events != event->events)
+	{
+		/* we could allow to disable latch events for a while */
+		elog(ERROR, "cannot modify latch event");
+	}
+	if (event->events & WL_POSTMASTER_DEATH)
+	{
+		elog(ERROR, "cannot modify postmaster death event");
+	}
+
+	/* FIXME: validate event mask */
+	event->events = events;
+
+	if (events == WL_LATCH_SET)
+	{
+		set->latch = latch;
+	}
+
+#if defined(WAIT_USE_EPOLL)
+	WaitEventAdjustEpoll(set, event, EPOLL_CTL_MOD);
+#elif defined(WAIT_USE_POLL)
+	WaitEventAdjustPoll(set, event);
+#elif defined(WAIT_USE_SELECT)
+	/* nothing to do */
+#elif defined(WAIT_USE_WIN32)
+	WaitEventAdjustWin32(set, event);
+#endif
+}
+
+/*
+ * Wait for events added to the set to happen, or until the timeout is
+ * reached.  At most nevents occurrent events are returned.
+ *
+ * Returns the number of events occurred, or 0 if the timeout was reached.
+ */
+int
+WaitEventSetWait(WaitEventSet *set, long timeout,
+				 WaitEvent *occurred_events, int nevents)
+{
+	int			returned_events = 0;
+	instr_time	start_time;
+	instr_time	cur_time;
+	long		cur_timeout = -1;
+
+	Assert(nevents > 0);
+
+	/*
+	 * Initialize timeout if requested.  We must record the current time so
+	 * that we can determine the remaining timeout if interrupted.
+	 */
+	if (timeout >= 0)
+	{
+		INSTR_TIME_SET_CURRENT(start_time);
+		Assert(timeout >= 0 && timeout <= INT_MAX);
+		cur_timeout = timeout;
+	}
+
+#ifndef WIN32
+	waiting = true;
+#else
+	/* Ensure that signals are serviced even if latch is already set */
+	pgwin32_dispatch_queued_signals();
+#endif
+	while (returned_events == 0)
+	{
+		int			rc;
+
+		/*
+		 * Check if the latch is set already. If so, leave the loop
+		 * immediately, avoid blocking again. We don't attempt to report any
+		 * other events that might also be satisfied.
+		 *
+		 * If someone sets the latch between this and the
+		 * WaitEventSetWaitBlock() below, the setter will write a byte to the
+		 * pipe (or signal us and the signal handler will do that), and the
+		 * readiness routine will return immediately.
+		 *
+		 * On unix, If there's a pending byte in the self pipe, we'll notice
+		 * whenever blocking. Only clearing the pipe in that case avoids
+		 * having to drain it every time WaitLatchOrSocket() is used. Should
+		 * the pipe-buffer fill up we're still ok, because the pipe is in
+		 * nonblocking mode. It's unlikely for that to happen, because the
+		 * self pipe isn't filled unless we're blocking (waiting = true), or
+		 * from inside a signal handler in latch_sigusr1_handler().
+		 *
+		 * On windows, we'll also notice if there's a pending event for the
+		 * latch when blocking, but there's no danger of anything filling up,
+		 * as "Setting an event that is already set has no effect.".
+		 *
+		 * Note: we assume that the kernel calls involved in latch management
+		 * will provide adequate synchronization on machines with weak memory
+		 * ordering, so that we cannot miss seeing is_set if a notification
+		 * has already been queued.
+		 */
+		if (set->latch && set->latch->is_set)
+		{
+			occurred_events->fd = -1;
+			occurred_events->pos = set->latch_pos;
+			occurred_events->events = WL_LATCH_SET;
+			occurred_events++;
+			returned_events++;
+
+			break;
+		}
+
+		/*
+		 * Wait for events using the readiness primitive chosen at the top of
+		 * this file. If -1 is returned, a timeout has occurred, if 0 we have
+		 * to retry, everything >= 1 is the number of returned events.
+		 */
+		rc = WaitEventSetWaitBlock(set, cur_timeout,
+								   occurred_events, nevents);
+
+		if (rc == -1)
+			break;				/* timeout occurred */
+		else
+			returned_events = rc;
+
+		/* If we're not done, update cur_timeout for next iteration */
+		if (returned_events == 0 && timeout >= 0)
+		{
+			INSTR_TIME_SET_CURRENT(cur_time);
+			INSTR_TIME_SUBTRACT(cur_time, start_time);
+			cur_timeout = timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
+			if (cur_timeout <= 0)
+				break;
+		}
+	}
+#ifndef WIN32
+	waiting = false;
+#endif
+
+	return returned_events;
+}
+
+#if defined(WAIT_USE_EPOLL)
+/*
+ * action can be one of EPOLL_CTL_ADD | EPOLL_CTL_MOD | EPOLL_CTL_DEL
+ */
+static void
+WaitEventAdjustEpoll(WaitEventSet *set, WaitEvent *event, int action)
+{
+	struct epoll_event epoll_ev;
+	int			rc;
+
+	/* pointer to our event, returned by epoll_wait */
+	epoll_ev.data.ptr = event;
+	/* always wait for errors */
+	epoll_ev.events = EPOLLERR | EPOLLHUP;
+
+	/* prepare pollfd entry once */
+	if (event->events == WL_LATCH_SET)
+	{
+		Assert(set->latch != NULL);
+		epoll_ev.events |= EPOLLIN;
+	}
+	else if (event->events == WL_POSTMASTER_DEATH)
+	{
+		epoll_ev.events |= EPOLLIN;
+	}
+	else
+	{
+		Assert(event->fd >= 0);
+		Assert(event->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE));
+
+		if (event->events & WL_SOCKET_READABLE)
+			epoll_ev.events |= EPOLLIN;
+		if (event->events & WL_SOCKET_WRITEABLE)
+			epoll_ev.events |= EPOLLOUT;
+	}
+
+	/*
+	 * Even though unused, we also poss epoll_ev as the data argument if
+	 * EPOLL_CTL_DELETE is passed as action.  There used to be an epoll bug
+	 * requiring that, and acutally it makes the code simpler...
+	 */
+	rc = epoll_ctl(set->epoll_fd, action, event->fd, &epoll_ev);
+
+	if (rc < 0)
+		ereport(ERROR,
+				(errcode_for_socket_access(),
+				 errmsg("epoll_ctl() failed: %m")));
+}
+#endif
+
+#if defined(WAIT_USE_POLL)
+static void
+WaitEventAdjustPoll(WaitEventSet *set, WaitEvent *event)
+{
+	struct pollfd *pollfd = &set->pollfds[event->pos];
+
+	pollfd->revents = 0;
+	pollfd->fd = event->fd;
+
+	/* prepare pollfd entry once */
+	if (event->events == WL_LATCH_SET)
+	{
+		Assert(set->latch != NULL);
+		pollfd->events = POLLIN;
+	}
+	else if (event->events == WL_POSTMASTER_DEATH)
+	{
+		pollfd->events = POLLIN;
+	}
+	else
+	{
+		Assert(event->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE));
+		pollfd->events = 0;
+		if (event->events & WL_SOCKET_READABLE)
+			pollfd->events |= POLLIN;
+		if (event->events & WL_SOCKET_WRITEABLE)
+			pollfd->events |= POLLOUT;
+	}
+
+	Assert(event->fd >= 0);
+}
+#endif
+
+#if defined(WAIT_USE_WIN32)
+static void
+WaitEventAdjustWin32(WaitEventSet *set, WaitEvent *event)
+{
+	HANDLE	   *handle = &set->handles[event->pos + 1];
+
+	if (event->events == WL_LATCH_SET)
+	{
+		Assert(set->latch != NULL);
+		*handle = set->latch->event;
+	}
+	else if (event->events == WL_POSTMASTER_DEATH)
+	{
+		*handle = PostmasterHandle;
+	}
+	else
+	{
+		int			flags = FD_CLOSE;	/* always check for errors/EOF */
+
+		if (event->events & WL_SOCKET_READABLE)
+			flags |= FD_READ;
+		if (event->events & WL_SOCKET_WRITEABLE)
+			flags |= FD_WRITE;
+
+		if (*handle != WSA_INVALID_EVENT)
+		{
+			*handle = WSACreateEvent();
+			if (*handle == WSA_INVALID_EVENT)
+				elog(ERROR, "failed to create event for socket: error code %u",
+					 WSAGetLastError());
+		}
+		if (WSAEventSelect(event->fd, *handle, flags) != 0)
+			elog(ERROR, "failed to set up event for socket: error code %u",
+				 WSAGetLastError());
+
+		Assert(event->fd >= 0);
+	}
+}
+#endif
+
+
+#if defined(WAIT_USE_EPOLL)
+
+/*
+ * Wait using linux' epoll_wait(2).
+ *
+ * This is the preferrable wait method, as several readiness notifications are
+ * delivered, without having to iterate through all of set->events. The return
+ * epoll_event struct contain a pointer to our events, making association
+ * easy.
+ */
+static int
+WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
+					  WaitEvent *occurred_events, int nevents)
+{
+	int			returned_events = 0;
+	int			rc;
+	WaitEvent  *cur_event;
+	struct epoll_event *cur_epoll_event;
+
+	/* Sleep */
+	rc = epoll_wait(set->epoll_fd, set->epoll_ret_events,
+					nevents, cur_timeout);
+
+	/* Check return code */
+	if (rc < 0)
+	{
+		/* EINTR is okay, otherwise complain */
+		if (errno != EINTR)
+		{
+			waiting = false;
+			ereport(ERROR,
+					(errcode_for_socket_access(),
+					 errmsg("epoll_wait() failed: %m")));
+		}
+		return 0;
+	}
+	else if (rc == 0)
+	{
+		/* timeout exceeded */
+		return -1;
+	}
+
+	/*
+	 * At least one event occurred, iterate over the returned epoll events
+	 * until they're either all processed, or we've returned all the events
+	 * the caller desired.
+	 */
+	for (cur_epoll_event = set->epoll_ret_events;
+		 cur_epoll_event < (set->epoll_ret_events + rc) &&
+		 returned_events < nevents;
+		 cur_epoll_event++)
+	{
+		/* epoll's data pointer is set to the associated WaitEvent */
+		cur_event = (WaitEvent *) cur_epoll_event->data.ptr;
+
+		occurred_events->pos = cur_event->pos;
+		occurred_events->events = 0;
+
+		if (cur_event->events == WL_LATCH_SET &&
+			cur_epoll_event->events & (EPOLLIN | EPOLLERR | EPOLLHUP))
+		{
+			/* There's data in the self-pipe, clear it. */
+			drainSelfPipe();
+
+			if (set->latch->is_set)
+			{
+				occurred_events->fd = -1;
+				occurred_events->events = WL_LATCH_SET;
+				occurred_events++;
+				returned_events++;
+			}
+		}
+		else if (cur_event->events == WL_POSTMASTER_DEATH &&
+				 cur_epoll_event->events & (EPOLLIN | EPOLLERR | EPOLLHUP))
+		{
+			/*
+			 * We expect an EPOLLHUP when the remote end is closed, but
+			 * because we don't expect the pipe to become readable or to have
+			 * any errors either, treat those cases as postmaster death, too.
+			 *
+			 * According to the select(2) man page on Linux, select(2) may
+			 * spuriously return and report a file descriptor as readable,
+			 * when it's not; and presumably so can epoll_wait(2).  It's not
+			 * clear that the relevant cases would ever apply to the
+			 * postmaster pipe, but since the consequences of falsely
+			 * returning WL_POSTMASTER_DEATH could be pretty unpleasant, we
+			 * take the trouble to positively verify EOF with
+			 * PostmasterIsAlive().
+			 */
+			if (!PostmasterIsAlive())
+			{
+				occurred_events->fd = -1;
+				occurred_events->events = WL_POSTMASTER_DEATH;
+				occurred_events++;
+				returned_events++;
+			}
+		}
+		else if (cur_event->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))
+		{
+			Assert(cur_event->fd >= 0);
+
+			if ((cur_event->events & WL_SOCKET_READABLE) &&
+				(cur_epoll_event->events & (EPOLLIN | EPOLLERR | EPOLLHUP)))
+			{
+				occurred_events->events |= WL_SOCKET_READABLE;
+			}
+
+			if ((cur_event->events & WL_SOCKET_WRITEABLE) &&
+				(cur_epoll_event->events & (EPOLLOUT | EPOLLERR | EPOLLHUP)))
+			{
+				occurred_events->events |= WL_SOCKET_WRITEABLE;
+			}
+
+			if (occurred_events->events != 0)
+			{
+				occurred_events->fd = cur_event->fd;
+				occurred_events++;
+				returned_events++;
+			}
+		}
+	}
+
+	return returned_events;
+}
+
+#elif defined(WAIT_USE_POLL)
+
+/*
+ * Wait using poll(2).
+ *
+ * This allows to receive readiness notifications for several events at once,
+ * but requires iterating through all of set->pollfds.
+ */
+static inline int
+WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
+					  WaitEvent *occurred_events, int nevents)
+{
+	int			returned_events = 0;
+	int			rc;
+	WaitEvent  *cur_event;
+	struct pollfd *cur_pollfd;
+
+	/* return immediately if latch is set */
+	if (set->latch && set->latch->is_set)
+	{
+		occurred_events->fd = -1;
+		occurred_events->pos = set->latch_pos;
+		occurred_events->events = WL_LATCH_SET;
+		occurred_events++;
+		returned_events++;
+
+		return returned_events;
+	}
+
+	/* Sleep */
+	rc = poll(set->pollfds, set->nevents, (int) cur_timeout);
+
+	/* Check return code */
+	if (rc < 0)
+	{
+		/* EINTR is okay, otherwise complain */
+		if (errno != EINTR)
+		{
+			waiting = false;
+			ereport(ERROR,
+					(errcode_for_socket_access(),
+					 errmsg("poll() failed: %m")));
+		}
+		return 0;
+	}
+	else if (rc == 0)
+	{
+		/* timeout exceeded */
+		return -1;
+	}
+
+	for (cur_event = set->events, cur_pollfd = set->pollfds;
+		 cur_event < (set->events + set->nevents) &&
+		 returned_events < nevents;
+		 cur_event++, cur_pollfd++)
+	{
+		occurred_events->pos = cur_event->pos;
+		occurred_events->events = 0;
+
+		if (cur_event->events == WL_LATCH_SET &&
+			(cur_pollfd->revents & (POLLIN | POLLHUP | POLLERR | POLLNVAL)))
+		{
+			/* There's data in the self-pipe, clear it. */
+			drainSelfPipe();
+
+			if (set->latch->is_set)
+			{
+				occurred_events->fd = -1;
+				occurred_events->events = WL_LATCH_SET;
+				occurred_events++;
+				returned_events++;
+			}
+		}
+		else if (cur_event->events == WL_POSTMASTER_DEATH &&
+			 (cur_pollfd->revents & (POLLIN | POLLHUP | POLLERR | POLLNVAL)))
+		{
+			/*
+			 * We expect an POLLHUP when the remote end is closed, but because
+			 * we don't expect the pipe to become readable or to have any
+			 * errors either, treat those cases as postmaster death, too.
+			 *
+			 * According to the select(2) man page on Linux, select(2) may
+			 * spuriously return and report a file descriptor as readable,
+			 * when it's not; and presumably so can poll(2).  It's not clear
+			 * that the relevant cases would ever apply to the postmaster
+			 * pipe, but since the consequences of falsely returning
+			 * WL_POSTMASTER_DEATH could be pretty unpleasant, we take the
+			 * trouble to positively verify EOF with PostmasterIsAlive().
+			 */
+			if (!PostmasterIsAlive())
+			{
+				occurred_events->fd = -1;
+				occurred_events->events = WL_POSTMASTER_DEATH;
+				occurred_events++;
+				returned_events++;
+			}
+		}
+		else if (cur_event->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))
+		{
+			Assert(cur_event->fd);
+
+			if ((cur_event->events & WL_SOCKET_READABLE) &&
+			 (cur_pollfd->revents & (POLLIN | POLLHUP | POLLERR | POLLNVAL)))
+			{
+				occurred_events->events |= WL_SOCKET_READABLE;
+			}
+
+			if ((cur_event->events & WL_SOCKET_WRITEABLE) &&
+			(cur_pollfd->revents & (POLLOUT | POLLHUP | POLLERR | POLLNVAL)))
+			{
+				occurred_events->events |= WL_SOCKET_WRITEABLE;
+			}
+
+			if (occurred_events->events != 0)
+			{
+				occurred_events->fd = cur_event->fd;
+				occurred_events++;
+				returned_events++;
+			}
+		}
+	}
+	return returned_events;
+}
+
+#elif defined(WAIT_USE_SELECT)
+
+/*
+ * Wait using select(2).
+ *
+ *
+ * On at least older linux kernels select(), in violation of POSIX,
+ * doesn't reliably return a socket as writable if closed - but we rely on
+ * that. So far all the known cases of this problem are on platforms that
+ * also provide a poll() implementation without that bug.  If we find one
+ * where that's not the case, we'll need to add a workaround.
+ */
+static inline int
+WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
+					  WaitEvent *occurred_events, int nevents)
+{
+	int			returned_events = 0;
+	int			rc;
+	WaitEvent  *cur_event;
+	fd_set		input_mask;
+	fd_set		output_mask;
+	int			hifd;
+	struct timeval tv;
+	struct timeval *tvp = NULL;
+
+	FD_ZERO(&input_mask);
+	FD_ZERO(&output_mask);
+
+	/*
+	 * Prepare input/output masks. We do so every loop iteration as there's no
+	 * entirely portable way to copy fd_sets.
+	 */
+	for (cur_event = set->events;
+		 cur_event < (set->events + set->nevents);
+		 cur_event++)
+	{
+		if (cur_event->events == WL_LATCH_SET)
+			FD_SET(cur_event->fd, &input_mask);
+		else if (cur_event->events == WL_POSTMASTER_DEATH)
+			FD_SET(cur_event->fd, &input_mask);
+		else
+		{
+			Assert(cur_event->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE));
+			if (cur_event->events == WL_SOCKET_READABLE)
+				FD_SET(cur_event->fd, &input_mask);
+			else if (cur_event->events == WL_SOCKET_WRITEABLE)
+				FD_SET(cur_event->fd, &output_mask);
+		}
+
+		if (cur_event->fd > hifd)
+			hifd = cur_event->fd;
+	}
+
+	/* Sleep */
+	if (cur_timeout >= 0)
+	{
+		tv.tv_sec = cur_timeout / 1000L;
+		tv.tv_usec = (cur_timeout % 1000L) * 1000L;
+		tvp = &tv;
+	}
+	rc = select(hifd + 1, &input_mask, &output_mask, NULL, tvp);
+
+	/* Check return code */
+	if (rc < 0)
+	{
+		/* EINTR is okay, otherwise complain */
+		if (errno != EINTR)
+		{
+			waiting = false;
+			ereport(ERROR,
+					(errcode_for_socket_access(),
+					 errmsg("select() failed: %m")));
+		}
+		return 0; /* retry */
+	}
+	else if (rc == 0)
+	{
+		/* timeout exceeded */
+		return -1;
+	}
+
+	/*
+	 * To associate events with select's masks, we have to check the status of
+	 * the file descriptors associated with an event; by looping through all
+	 * events.
+	 */
+	for (cur_event = set->events;
+		 cur_event < (set->events + set->nevents)
+		 && returned_events < nevents;
+		 cur_event++)
+	{
+		occurred_events->pos = cur_event->pos;
+		occurred_events->events = 0;
+
+		if (cur_event->events == WL_LATCH_SET &&
+			FD_ISSET(cur_event->fd, &input_mask))
+		{
+			/* There's data in the self-pipe, clear it. */
+			drainSelfPipe();
+
+			if (set->latch->is_set)
+			{
+				occurred_events->fd = -1;
+				occurred_events->events = WL_LATCH_SET;
+				occurred_events++;
+				returned_events++;
+			}
+		}
+		else if (cur_event->events == WL_POSTMASTER_DEATH &&
+				 FD_ISSET(cur_event->fd, &input_mask))
+		{
+			/*
+			 * According to the select(2) man page on Linux, select(2) may
+			 * spuriously return and report a file descriptor as readable,
+			 * when it's not; and presumably so can poll(2).  It's not clear
+			 * that the relevant cases would ever apply to the postmaster
+			 * pipe, but since the consequences of falsely returning
+			 * WL_POSTMASTER_DEATH could be pretty unpleasant, we take the
+			 * trouble to positively verify EOF with PostmasterIsAlive().
+			 */
+			if (!PostmasterIsAlive())
+			{
+				occurred_events->fd = -1;
+				occurred_events->events = WL_POSTMASTER_DEATH;
+				occurred_events++;
+				returned_events++;
+			}
+		}
+		else if (cur_event->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))
+		{
+			Assert(cur_event->fd >= 0);
+
+			if ((cur_event->events & WL_SOCKET_READABLE) &&
+				FD_ISSET(cur_event->fd, &input_mask))
+			{
+				/* data available in socket, or EOF */
+				occurred_events->events |= WL_SOCKET_READABLE;
+			}
+
+			if ((cur_event->events & WL_SOCKET_WRITEABLE) &&
+				FD_ISSET(cur_event->fd, &output_mask))
+			{
+				/* socket is writeable, or EOF */
+				occurred_events->events |= WL_SOCKET_WRITEABLE;
+			}
+
+			if (occurred_events->events != 0)
+			{
+				occurred_events->fd = cur_event->fd;
+				occurred_events++;
+				returned_events++;
+			}
+		}
+	}
+	return returned_events;
+}
+
+#elif defined(WAIT_USE_WIN32)
+
+/*
+ * Wait using Windows' WaitForMultipleObjects().
+ *
+ * Unfortunately this will only ever return a single readiness notification at
+ * a a time.
+ */
+static inline int
+WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
+					  WaitEvent *occurred_events, int nevents)
+{
+	int			returned_events = 0;
+	DWORD		rc;
+	WaitEvent  *cur_event;
+
+	/*
+	 * Sleep.
+	 *
+	 * Need to wait for ->nevents + 1, because signal handle is in [0].
+	 */
+	rc = WaitForMultipleObjects(set->nevents + 1, set->handles, FALSE,
+								cur_timeout);
+
+	/* Check return code */
+	if (rc == WAIT_FAILED)
+		elog(ERROR, "WaitForMultipleObjects() failed: error code %lu",
+			 GetLastError());
+	else if (rc == WAIT_TIMEOUT)
+	{
+		/* timeout exceeded */
+		return -1;
+	}
+
+	if (rc == WAIT_OBJECT_0)
+	{
+		/* Service newly-arrived signals */
+		pgwin32_dispatch_queued_signals();
+		return 0;				/* retry */
+	}
+
+	/*
+	 * With an offset of one, due to pgwin32_signal_event, the handle offset
+	 * directly corresponds to a wait event.
+	 */
+	cur_event = (WaitEvent *) &set->events[rc - WAIT_OBJECT_0 - 1];
+
+	occurred_events->pos = cur_event->pos;
+	occurred_events->events = 0;
+
+	if (cur_event->events == WL_LATCH_SET)
+	{
+		if (!ResetEvent(set->latch->event))
+			elog(ERROR, "ResetEvent failed: error code %lu", GetLastError());
+
+		if (set->latch->is_set)
+		{
+			occurred_events->fd = -1;
+			occurred_events->events = WL_LATCH_SET;
+			occurred_events++;
+			returned_events++;
+		}
+	}
+	else if (cur_event->events == WL_POSTMASTER_DEATH)
+	{
+		/*
+		 * Postmaster apparently died.  Since the consequences of falsely
+		 * returning WL_POSTMASTER_DEATH could be pretty unpleasant, we take
+		 * the trouble to positively verify this with PostmasterIsAlive(),
+		 * even though there is no known reason to think that the event could
+		 * be falsely set on Windows.
+		 */
+		if (!PostmasterIsAlive())
+		{
+			occurred_events->fd = -1;
+			occurred_events->events = WL_POSTMASTER_DEATH;
+			occurred_events++;
+			returned_events++;
+		}
+	}
+	else if (cur_event->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))
+	{
+		WSANETWORKEVENTS resEvents;
+
+		Assert(cur_event->fd);
+
+		occurred_events->fd = cur_event->fd;
+
+		ZeroMemory(&resEvents, sizeof(resEvents));
+		if (WSAEnumNetworkEvents(cur_event->fd, set->handles[cur_event->pos], &resEvents) != 0)
+			elog(ERROR, "failed to enumerate network events: error code %u",
+				 WSAGetLastError());
+		if ((cur_event->events & WL_SOCKET_READABLE) &&
+			(resEvents.lNetworkEvents & FD_READ))
+		{
+			occurred_events->events |= WL_SOCKET_READABLE;
+		}
+		if ((cur_event->events & WL_SOCKET_WRITEABLE) &&
+			(resEvents.lNetworkEvents & FD_WRITE))
+		{
+			occurred_events->events |= WL_SOCKET_WRITEABLE;
+		}
+		if (resEvents.lNetworkEvents & FD_CLOSE)
+		{
+			if (cur_event->events & WL_SOCKET_READABLE)
+				occurred_events->events |= WL_SOCKET_READABLE;
+			if (cur_event->events & WL_SOCKET_WRITEABLE)
+				occurred_events->events |= WL_SOCKET_WRITEABLE;
+		}
+
+		if (occurred_events->events != 0)
+		{
+			occurred_events++;
+			returned_events++;
+		}
+	}
+
+	return returned_events;
+}
+#endif
+
+/*
  * SetLatch uses SIGUSR1 to wake up the process waiting on the latch.
  *
  * Wake up WaitLatch, if we're waiting.  (We might not be, since SIGUSR1 is
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 18f5e6f..d13355b 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -33,6 +33,7 @@
 
 #include "access/htup_details.h"
 #include "catalog/pg_authid.h"
+#include "libpq/libpq.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "postmaster/autovacuum.h"
@@ -247,6 +248,9 @@ SwitchToSharedLatch(void)
 
 	MyLatch = &MyProc->procLatch;
 
+	if (FeBeWaitSet)
+		ModifyWaitEvent(FeBeWaitSet, 1, WL_LATCH_SET, MyLatch);
+
 	/*
 	 * Set the shared latch as the local one might have been set. This
 	 * shouldn't normally be necessary as code is supposed to check the
@@ -262,6 +266,10 @@ SwitchBackToLocalLatch(void)
 	Assert(MyProc != NULL && MyLatch == &MyProc->procLatch);
 
 	MyLatch = &LocalLatchData;
+
+	if (FeBeWaitSet)
+		ModifyWaitEvent(FeBeWaitSet, 1, WL_LATCH_SET, MyLatch);
+
 	SetLatch(MyLatch);
 }
 
diff --git a/src/include/libpq/libpq.h b/src/include/libpq/libpq.h
index 0569994..109fdf7 100644
--- a/src/include/libpq/libpq.h
+++ b/src/include/libpq/libpq.h
@@ -19,6 +19,7 @@
 
 #include "lib/stringinfo.h"
 #include "libpq/libpq-be.h"
+#include "storage/latch.h"
 
 
 typedef struct
@@ -95,6 +96,8 @@ extern ssize_t secure_raw_write(Port *port, const void *ptr, size_t len);
 
 extern bool ssl_loaded_verify_locations;
 
+WaitEventSet *FeBeWaitSet;
+
 /* GUCs */
 extern char *SSLCipherSuites;
 extern char *SSLECDHCurve;
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 3813226..c72635c 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -530,6 +530,9 @@
 /* Define to 1 if you have the syslog interface. */
 #undef HAVE_SYSLOG
 
+/* Define to 1 if you have the <sys/epoll.h> header file. */
+#undef HAVE_SYS_EPOLL_H
+
 /* Define to 1 if you have the <sys/ioctl.h> header file. */
 #undef HAVE_SYS_IOCTL_H
 
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 2719498..fa66ec3 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -102,9 +102,23 @@ typedef struct Latch
 #define WL_TIMEOUT			 (1 << 3)
 #define WL_POSTMASTER_DEATH  (1 << 4)
 
+typedef struct WaitEventSet WaitEventSet;
+
+typedef struct WaitEvent
+{
+	int		pos;		/* position in the event data structure */
+	uint32	events;		/* tripped events */
+	int		fd;			/* fd associated with event */
+} WaitEvent;
+
 /*
  * prototypes for functions in latch.c
  */
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern void FreeWaitEventSet(WaitEventSet *set);
+extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, int fd, Latch *latch);
+extern void ModifyWaitEvent(WaitEventSet *set, int pos, uint32 events, Latch *latch);
+extern int WaitEventSetWait(WaitEventSet *set, long timeout, WaitEvent* occurred_events, int nevents);
 extern void InitializeLatchSupport(void);
 extern void InitLatch(volatile Latch *latch);
 extern void InitSharedLatch(volatile Latch *latch);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b850db0..c2511de 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2113,6 +2113,8 @@ WalSnd
 WalSndCtlData
 WalSndSendDataCallback
 WalSndState
+WaitEvent
+WaitEventSet
 WholeRowVarExprState
 WindowAgg
 WindowAggState
-- 
2.7.0.229.g701fa7f

#59

Robert Haas

robertmhaas@gmail.com

almost 10 years ago

In reply to: Andres Freund (#56)

Re: Performance degradation in commit ac1d794

On Thu, Mar 17, 2016 at 10:57 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-03-17 09:01:36 -0400, Robert Haas wrote:

0001: Looking at this again, I'm no longer sure this is a bug.
Doesn't your patch just check the same conditions in the opposite
order?

Yes, that's what's required

I mean, they are just variables. You can check them in either order
and get the same results, no?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#60

Robert Haas

robertmhaas@gmail.com

almost 10 years ago

In reply to: Andres Freund (#55)

Re: Performance degradation in commit ac1d794

On Thu, Mar 17, 2016 at 10:53 AM, Andres Freund <andres@anarazel.de> wrote:

I wonder if there's a way to refactor this code to avoid having so
much cut-and-paste duplication.

I guess you mean WaitEventSetWait() and WaitEventAdjust*? I've tried,
and my attempt ended up look nearly unreadable, because of the number of
ifdefs. I've not found a good attempt. Which is sad, because adding back
select support is going to increase the duplication further :( - but
it's also further away from poll etc. (different type of timestamp,
entirely different way of returming events).

I was more thinking of stuff like this:

+            /*
+             * We expect an EPOLLHUP when the remote end is closed, but
+             * because we don't expect the pipe to become readable or to have
+             * any errors either, treat those cases as postmaster death, too.
+             *
+             * According to the select(2) man page on Linux, select(2) may
+             * spuriously return and report a file descriptor as readable,
+             * when it's not; and presumably so can epoll_wait(2).  It's not
+             * clear that the relevant cases would ever apply to the
+             * postmaster pipe, but since the consequences of falsely
+             * returning WL_POSTMASTER_DEATH could be pretty unpleasant, we
+             * take the trouble to positively verify EOF with
+             * PostmasterIsAlive().
+             */

0 at the top of the loop and skip it forthwith if so.

You mean in WaitEventSetWait()? There's
else if (rc == 0)
{
break;
}
which is the timeout case. There should never be any other case of
returning 0 elements?

No, I meant if (cur_event->events == 0) continue;

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#61

Robert Haas

robertmhaas@gmail.com

almost 10 years ago

In reply to: Andres Freund (#57)

Re: Performance degradation in commit ac1d794

On Thu, Mar 17, 2016 at 11:17 AM, Andres Freund <andres@anarazel.de> wrote:

Right now, do use a WaitEventSet you'd do something like
WaitEvent event;

ModifyWaitEvent(FeBeWaitSet, 0, waitfor, NULL);

WaitEventSetWait(FeBeWaitSet, 0 /* no timeout */, &event, 1);

i.e. use a WaitEvent on the stack to receive the changes. If you wanted
to get more changes than just one, you could end up allocating a fair
bit of stack space.

We could instead allocate the returned events as part of the event set,
and return them. Either by returning a NULL terminated array, or by
continuing to return the number of events as now, and additionally
return the event data structure via a pointer.

So the above would be

WaitEvent *events;
int nevents;

ModifyWaitEvent(FeBeWaitSet, 0, waitfor, NULL);

nevents = WaitEventSetWait(FeBeWaitSet, 0 /* no timeout */, events, 10);

for (int off = 0; off <= nevents; nevents++)
; // stuff

I don't see this as being particularly better.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#62

Amit Kapila

amit.kapila16@gmail.com

almost 10 years ago

In reply to: Andres Freund (#58)

Re: Performance degradation in commit ac1d794

On Fri, Mar 18, 2016 at 1:34 PM, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2016-03-17 09:01:36 -0400, Robert Haas wrote:

0001: Looking at this again, I'm no longer sure this is a bug.
Doesn't your patch just check the same conditions in the opposite
order?

Which is important, because what's in what pfds[x] depends on
wakeEvents. Folded it into a later patch; it's not harmful as long as
we're only ever testing pfds[0].

0003: Mostly boring. But the change to win32_latch.c seems to remove
an unrelated check.

Argh.

+ * from inside a signal handler in latch_sigusr1_handler().

* Note: we assume that the kernel calls involved in drainSelfPipe()

* and SetLatch() will provide adequate synchronization on machines

* with weak memory ordering, so that we cannot miss seeing is_set if

* the signal byte is already in the pipe when we drain it.

- drainSelfPipe();

Above part of comment looks redundant after this patch. I have done some
tests on Windows with 0003 patch which includes running the regressions
(vcregress check) and it passes. Will look into it tomorrow once again and
share if I find anything wrong with it, but feel to proceed if you want.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#63

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Robert Haas (#60)

Re: Performance degradation in commit ac1d794

On 2016-03-18 05:56:41 -0400, Robert Haas wrote:

0 at the top of the loop and skip it forthwith if so.

You mean in WaitEventSetWait()? There's
else if (rc == 0)
{
break;
}
which is the timeout case. There should never be any other case of
returning 0 elements?

No, I meant if (cur_event->events == 0) continue;

I'm not following. Why would there be an event without an empty event
mask? Ok, you can disable all notifications for a socket using
ModifyWaitEvent(), but that's not particularly common, right? At least
for epoll, it'd not play a role anyway, since epoll_wait() will actually
return pointers to the elements we're waiting on; for windows we get the
offset in ->handles. I guess we could do so in the select/poll case,
but adding another if for something infrequent doesn't strike me as a
great benefit.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#64

Robert Haas

robertmhaas@gmail.com

almost 10 years ago

In reply to: Andres Freund (#63)

Re: Performance degradation in commit ac1d794

On Fri, Mar 18, 2016 at 1:53 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-03-18 05:56:41 -0400, Robert Haas wrote:

0 at the top of the loop and skip it forthwith if so.

You mean in WaitEventSetWait()? There's
else if (rc == 0)
{
break;
}
which is the timeout case. There should never be any other case of
returning 0 elements?

No, I meant if (cur_event->events == 0) continue;

I'm not following. Why would there be an event without an empty event
mask? Ok, you can disable all notifications for a socket using
ModifyWaitEvent(), but that's not particularly common, right? At least
for epoll, it'd not play a role anyway, since epoll_wait() will actually
return pointers to the elements we're waiting on; for windows we get the
offset in ->handles. I guess we could do so in the select/poll case,
but adding another if for something infrequent doesn't strike me as a
great benefit.

No, I mean it should be quite common for a particular fd to have no
events reported. If we're polling on 100 fds and 1 of them is active
and the other 99 are just sitting there, we want to skip over the
other 99 as quickly as possible.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#65

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Robert Haas (#64)

Re: Performance degradation in commit ac1d794

On 2016-03-18 14:00:58 -0400, Robert Haas wrote:

No, I mean it should be quite common for a particular fd to have no
events reported. If we're polling on 100 fds and 1 of them is active
and the other 99 are just sitting there, we want to skip over the
other 99 as quickly as possible.

cur_events points to the event that's registered with the set, not the
one that's "returned" or "modified" by poll/select, that's where the
confusion is originating from. I'll add a fastpath.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#66

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Amit Kapila (#62)

Re: Performance degradation in commit ac1d794

On 2016-03-18 20:14:07 +0530, Amit Kapila wrote:

+ * from inside a signal handler in latch_sigusr1_handler().

*

* Note: we assume that the kernel calls involved in drainSelfPipe()

* and SetLatch() will provide adequate synchronization on machines

* with weak memory ordering, so that we cannot miss seeing is_set if

* the signal byte is already in the pipe when we drain it.

*/

- drainSelfPipe();

-

Above part of comment looks redundant after this patch.

Don't think so. Moving it closer to the drainSelfPipe() call might be
neat, but since there's several callsites...

I have done some
tests on Windows with 0003 patch which includes running the regressions
(vcregress check) and it passes. Will look into it tomorrow once again and
share if I find anything wrong with it, but feel to proceed if you want.

Thanks for the testing thus far! Let's see what the buildfarm has to
say.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#67

Alvaro Herrera

alvherre@2ndquadrant.com

almost 10 years ago

In reply to: Andres Freund (#58)

Re: Performance degradation in commit ac1d794

Andres Freund wrote:

From 1d444b0855dbf65d66d73beb647b772fff3404c8 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 18 Mar 2016 00:52:07 -0700
Subject: [PATCH 4/5] Combine win32 and unix latch implementations.

Previously latches for windows and unix had been implemented in
different files. The next patch in this series will introduce an
expanded wait infrastructure, keeping the implementation separate would
introduce too much duplication.

This basically just moves the functions, without too much change. The
reason to keep this separate is that it allows blame to continue working
a little less badly; and to make review a tiny bit easier.

This seems a reasonable change, but I think that the use of WIN32 vs.
LATCH_USE_WIN32 is pretty confusing. In particular, LATCH_USE_WIN32
isn't actually used for anything ... I suppose we don't care since this
is a temporary state of affairs only?

In 0005: In latch.c you typedef WaitEventSet, but the typedef already
appears in latch.h. You need only declare the struct in latch.c,
without typedef'ing.

Haven't really reviewed anything here yet, just skimming ATM. Having so
many #ifdefs all over the place in this file looks really bad, but I
guess there's no way around that because this is very platform-specific.
I hope pgindent doesn't choke on it.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#68

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Alvaro Herrera (#67)

Re: Performance degradation in commit ac1d794

On 2016-03-18 15:49:51 -0300, Alvaro Herrera wrote:

This seems a reasonable change, but I think that the use of WIN32 vs.
LATCH_USE_WIN32 is pretty confusing. In particular, LATCH_USE_WIN32
isn't actually used for anything ... I suppose we don't care since this
is a temporary state of affairs only?

Hm, I guess we could make some more of those use LATCH_USE_WIN32.
There's essentially two axes here a) latch notification method
(self-pipe vs windows events) b) readiness notification (epoll vs poll
vs select vs WaitForMultipleObjects).

In 0005: In latch.c you typedef WaitEventSet, but the typedef already
appears in latch.h. You need only declare the struct in latch.c,
without typedef'ing.

Good catch. It's even important, some compilers choke on that.

Haven't really reviewed anything here yet, just skimming ATM. Having so
many #ifdefs all over the place in this file looks really bad, but I
guess there's no way around that because this is very platform-specific.

I think it's hard to further reduce the number of ifdefs, but if you
have ideas...

I hope pgindent doesn't choke on it.

The patch is pgindented (I'd personally never decrease indentation just
to fit a line into 79 chars, as pgindent does...).

Thanks for looking!

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#69

Amit Kapila

amit.kapila16@gmail.com

almost 10 years ago

In reply to: Andres Freund (#66)

Re: Performance degradation in commit ac1d794

On Sat, Mar 19, 2016 at 12:00 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-03-18 20:14:07 +0530, Amit Kapila wrote:

I have done some
tests on Windows with 0003 patch which includes running the regressions
(vcregress check) and it passes. Will look into it tomorrow once again

and

share if I find anything wrong with it, but feel to proceed if you want.

Thanks for the testing thus far! Let's see what the buildfarm has to
say.

Won't the new code needs to ensure that ResetEvent(latchevent) should get
called in case WaitForMultipleObjects() comes out when both
pgwin32_signal_event and latchevent are signalled at the same time?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#70

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Amit Kapila (#69)

Re: Performance degradation in commit ac1d794

On March 18, 2016 11:32:53 PM PDT, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sat, Mar 19, 2016 at 12:00 AM, Andres Freund <andres@anarazel.de>
wrote:

On 2016-03-18 20:14:07 +0530, Amit Kapila wrote:

I have done some
tests on Windows with 0003 patch which includes running the

regressions

(vcregress check) and it passes. Will look into it tomorrow once

again
and

share if I find anything wrong with it, but feel to proceed if you

want.

Thanks for the testing thus far! Let's see what the buildfarm has to
say.

Won't the new code needs to ensure that ResetEvent(latchevent) should
get
called in case WaitForMultipleObjects() comes out when both
pgwin32_signal_event and latchevent are signalled at the same time?

WaitForMultiple only reports the readiness of on event at a time, no?

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#71

Amit Kapila

amit.kapila16@gmail.com

almost 10 years ago

In reply to: Andres Freund (#70)

Re: Performance degradation in commit ac1d794

On Sat, Mar 19, 2016 at 12:14 PM, Andres Freund <andres@anarazel.de> wrote:

On March 18, 2016 11:32:53 PM PDT, Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Sat, Mar 19, 2016 at 12:00 AM, Andres Freund <andres@anarazel.de>
wrote:

On 2016-03-18 20:14:07 +0530, Amit Kapila wrote:

I have done some
tests on Windows with 0003 patch which includes running the

regressions

(vcregress check) and it passes. Will look into it tomorrow once

again
and

share if I find anything wrong with it, but feel to proceed if you

want.

Thanks for the testing thus far! Let's see what the buildfarm has to
say.

Won't the new code needs to ensure that ResetEvent(latchevent) should
get
called in case WaitForMultipleObjects() comes out when both
pgwin32_signal_event and latchevent are signalled at the same time?

WaitForMultiple only reports the readiness of on event at a time, no?

I don't think so, please read link [1]https://msdn.microsoft.com/en-us/library/windows/desktop/ms687025(v=vs.85).aspx with a focus on below paragraph
which states how it reports the readiness or signaled state when multiple
objects become signaled.

"When *bWaitAll* is *FALSE*, this function checks the handles in the array
in order starting with index 0, until one of the objects is signaled. If
multiple objects become signaled, the function returns the index of the
first handle in the array whose object was signaled."

[1]: https://msdn.microsoft.com/en-us/library/windows/desktop/ms687025(v=vs.85).aspx
https://msdn.microsoft.com/en-us/library/windows/desktop/ms687025(v=vs.85).aspx

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#72

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Amit Kapila (#71)

Re: Performance degradation in commit ac1d794

On March 18, 2016 11:52:08 PM PDT, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sat, Mar 19, 2016 at 12:14 PM, Andres Freund <andres@anarazel.de>
wrote:

On March 18, 2016 11:32:53 PM PDT, Amit Kapila

<amit.kapila16@gmail.com>
wrote:

On Sat, Mar 19, 2016 at 12:00 AM, Andres Freund <andres@anarazel.de>
wrote:

On 2016-03-18 20:14:07 +0530, Amit Kapila wrote:

I have done some
tests on Windows with 0003 patch which includes running the

regressions

(vcregress check) and it passes. Will look into it tomorrow

once

again
and

share if I find anything wrong with it, but feel to proceed if

you

want.

Thanks for the testing thus far! Let's see what the buildfarm has

to

say.

Won't the new code needs to ensure that ResetEvent(latchevent)

should

get
called in case WaitForMultipleObjects() comes out when both
pgwin32_signal_event and latchevent are signalled at the same time?

WaitForMultiple only reports the readiness of on event at a time, no?

I don't think so, please read link [1] with a focus on below paragraph
which states how it reports the readiness or signaled state when
multiple
objects become signaled.

"When *bWaitAll* is *FALSE*, this function checks the handles in the
array
in order starting with index 0, until one of the objects is signaled.
If
multiple objects become signaled, the function returns the index of the
first handle in the array whose object was signaled."

I think that's OK. We'll just get the next event the next time we call waitfor*. It's also not different to the way the routine is currently handling normal socket and postmaster events, no? It's be absurdly broken if it handled edge triggered enemy's like FD_CLOSE in a way you can't discover.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#73

Amit Kapila

amit.kapila16@gmail.com

almost 10 years ago

In reply to: Andres Freund (#72)

Re: Performance degradation in commit ac1d794

On Sat, Mar 19, 2016 at 12:40 PM, Andres Freund <andres@anarazel.de> wrote:

On March 18, 2016 11:52:08 PM PDT, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Won't the new code needs to ensure that ResetEvent(latchevent)

should

get
called in case WaitForMultipleObjects() comes out when both
pgwin32_signal_event and latchevent are signalled at the same time?

WaitForMultiple only reports the readiness of on event at a time, no?

I don't think so, please read link [1] with a focus on below paragraph
which states how it reports the readiness or signaled state when
multiple
objects become signaled.

"When *bWaitAll* is *FALSE*, this function checks the handles in the
array
in order starting with index 0, until one of the objects is signaled.
If
multiple objects become signaled, the function returns the index of the
first handle in the array whose object was signaled."

I think that's OK. We'll just get the next event the next time we call

waitfor*. It's also not different to the way the routine is currently
handling normal socket and postmaster events, no?

I think the primary difference with socket and postmaster event as compare
to latch event is that it won't allow to start waiting with the waitevent
in signalled state. For socket event, it will close the event in the end
and create again before entring the wait loop in WaitLatchOrSocket. I
could not see any major problem apart from may be spurious wake ups in few
cases (as we haven't reset the event to non signalled state for latch event
before entering wait, so it can just return immediately) even if we don't
Reset the latch event.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#74

Amit Kapila

amit.kapila16@gmail.com

almost 10 years ago

In reply to: Andres Freund (#58)

Re: Performance degradation in commit ac1d794

On Fri, Mar 18, 2016 at 1:34 PM, Andres Freund <andres@anarazel.de> wrote:

Attached is a significantly revised version of the earlier series. Most
importantly I have:
* Unified the window/unix latch implementation into one file (0004)

After applying patch 0004* on HEAD, using command patch -p1 <
<path_of_patch>, I am getting build failure:

c1 : fatal error C1083: Cannot open source file:
'src/backend/storage/ipc/latch.c': No such file or directory

I think it could not rename port/unix_latch.c => storage/ipc/latch.c. I
have tried with git apply, but no success. Am I doing something wrong?

One minor suggestion about patch:

+#ifndef WIN32
void
latch_sigusr1_handler(void)
{
if (waiting)
sendSelfPipeByte();
}
+#endif /* !WIN32 */

/* Send one byte to the self-pipe, to wake up WaitLatch */
+#ifndef WIN32
static void
sendSelfPipeByte(void)

Instead of individually defining these functions under #ifndef WIN32, isn't
it better to combine them all as they are at end of file.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#75

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Amit Kapila (#73)

Re: Performance degradation in commit ac1d794

On 2016-03-19 15:43:27 +0530, Amit Kapila wrote:

On Sat, Mar 19, 2016 at 12:40 PM, Andres Freund <andres@anarazel.de> wrote:

On March 18, 2016 11:52:08 PM PDT, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Won't the new code needs to ensure that ResetEvent(latchevent)

should

get
called in case WaitForMultipleObjects() comes out when both
pgwin32_signal_event and latchevent are signalled at the same time?

WaitForMultiple only reports the readiness of on event at a time, no?

I don't think so, please read link [1] with a focus on below paragraph
which states how it reports the readiness or signaled state when
multiple
objects become signaled.

"When *bWaitAll* is *FALSE*, this function checks the handles in the
array
in order starting with index 0, until one of the objects is signaled.
If
multiple objects become signaled, the function returns the index of the
first handle in the array whose object was signaled."

I think this is just incredibly bad documentation. See
https://blogs.msdn.microsoft.com/oldnewthing/20150409-00/?p=44273
(Raymond Chen can be considered an authority here imo).

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#76

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Amit Kapila (#74)

Re: Performance degradation in commit ac1d794

On 2016-03-19 16:44:49 +0530, Amit Kapila wrote:

On Fri, Mar 18, 2016 at 1:34 PM, Andres Freund <andres@anarazel.de> wrote:

Attached is a significantly revised version of the earlier series. Most
importantly I have:
* Unified the window/unix latch implementation into one file (0004)

After applying patch 0004* on HEAD, using command patch -p1 <
<path_of_patch>, I am getting build failure:

c1 : fatal error C1083: Cannot open source file:
'src/backend/storage/ipc/latch.c': No such file or directory

I think it could not rename port/unix_latch.c => storage/ipc/latch.c. I
have tried with git apply, but no success. Am I doing something wrong?

You skipped applying 0003.

I'll send an updated version - with all the docs and such - in the next
hours.

One minor suggestion about patch:

+#ifndef WIN32
void
latch_sigusr1_handler(void)
{
if (waiting)
sendSelfPipeByte();
}
+#endif /* !WIN32 */

/* Send one byte to the self-pipe, to wake up WaitLatch */
+#ifndef WIN32
static void
sendSelfPipeByte(void)

Instead of individually defining these functions under #ifndef WIN32, isn't
it better to combine them all as they are at end of file.

They're all at the end of the file. I just found it easier to reason
about if both #if and #endif are visible in one screen full of
code. Don't feel super strong about it tho.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#77

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Andres Freund (#76)

2 attachment(s)

Re: Performance degradation in commit ac1d794

On 2016-03-19 18:45:36 -0700, Andres Freund wrote:

On 2016-03-19 16:44:49 +0530, Amit Kapila wrote:

On Fri, Mar 18, 2016 at 1:34 PM, Andres Freund <andres@anarazel.de> wrote:

Attached is a significantly revised version of the earlier series. Most
importantly I have:
* Unified the window/unix latch implementation into one file (0004)

After applying patch 0004* on HEAD, using command patch -p1 <
<path_of_patch>, I am getting build failure:

c1 : fatal error C1083: Cannot open source file:
'src/backend/storage/ipc/latch.c': No such file or directory

I think it could not rename port/unix_latch.c => storage/ipc/latch.c. I
have tried with git apply, but no success. Am I doing something wrong?

You skipped applying 0003.

I'll send an updated version - with all the docs and such - in the next
hours.

Here we go. I think this is getting pretty clos eto being committable,
minus a bit of testing edge cases on unix (postmaster death,
disconnecting clients in various ways (especially with COPY)) and
windows (uh, does it even work at all?).

There's no large code changes in this revision, mainly some code
polishing and a good bit more comment improvements.

Regards,

Andres

Attachments:

0001-Combine-win32-and-unix-latch-implementations.patchtext/x-patch; charset=us-asciiDownload

From 9195a0136c44eca4d798d1de478a477ab0ac4724 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 18 Mar 2016 00:52:07 -0700
Subject: [PATCH 1/2] Combine win32 and unix latch implementations.

Previously latches for windows and unix had been implemented in
different files. The next patch in this series will introduce an
expanded wait infrastructure, keeping the implementation separate would
introduce too much duplication.

This basically just moves the functions, without too much change. The
reason to keep this separate is that it allows blame to continue working
a little less badly; and to make review a tiny bit easier.
---
 configure                                          |  10 +-
 configure.in                                       |   8 -
 src/backend/Makefile                               |   3 +-
 src/backend/port/.gitignore                        |   1 -
 src/backend/port/Makefile                          |   2 +-
 src/backend/port/win32_latch.c                     | 349 ---------------------
 src/backend/storage/ipc/Makefile                   |   5 +-
 .../{port/unix_latch.c => storage/ipc/latch.c}     | 280 ++++++++++++++++-
 src/include/storage/latch.h                        |   2 +-
 src/tools/msvc/Mkvcbuild.pm                        |   2 -
 10 files changed, 277 insertions(+), 385 deletions(-)
 delete mode 100644 src/backend/port/win32_latch.c
 rename src/backend/{port/unix_latch.c => storage/ipc/latch.c} (74%)

diff --git a/configure b/configure
index a45be67..c10d954 100755
--- a/configure
+++ b/configure
@@ -14786,13 +14786,6 @@ $as_echo "#define USE_WIN32_SHARED_MEMORY 1" >>confdefs.h
   SHMEM_IMPLEMENTATION="src/backend/port/win32_shmem.c"
 fi
 
-# Select latch implementation type.
-if test "$PORTNAME" != "win32"; then
-  LATCH_IMPLEMENTATION="src/backend/port/unix_latch.c"
-else
-  LATCH_IMPLEMENTATION="src/backend/port/win32_latch.c"
-fi
-
 # If not set in template file, set bytes to use libc memset()
 if test x"$MEMSET_LOOP_LIMIT" = x"" ; then
   MEMSET_LOOP_LIMIT=1024
@@ -15868,7 +15861,7 @@ fi
 ac_config_files="$ac_config_files GNUmakefile src/Makefile.global"
 
 
-ac_config_links="$ac_config_links src/backend/port/dynloader.c:src/backend/port/dynloader/${template}.c src/backend/port/pg_sema.c:${SEMA_IMPLEMENTATION} src/backend/port/pg_shmem.c:${SHMEM_IMPLEMENTATION} src/backend/port/pg_latch.c:${LATCH_IMPLEMENTATION} src/include/dynloader.h:src/backend/port/dynloader/${template}.h src/include/pg_config_os.h:src/include/port/${template}.h src/Makefile.port:src/makefiles/Makefile.${template}"
+ac_config_links="$ac_config_links src/backend/port/dynloader.c:src/backend/port/dynloader/${template}.c src/backend/port/pg_sema.c:${SEMA_IMPLEMENTATION} src/backend/port/pg_shmem.c:${SHMEM_IMPLEMENTATION} src/include/dynloader.h:src/backend/port/dynloader/${template}.h src/include/pg_config_os.h:src/include/port/${template}.h src/Makefile.port:src/makefiles/Makefile.${template}"
 
 
 if test "$PORTNAME" = "win32"; then
@@ -16592,7 +16585,6 @@ do
     "src/backend/port/dynloader.c") CONFIG_LINKS="$CONFIG_LINKS src/backend/port/dynloader.c:src/backend/port/dynloader/${template}.c" ;;
     "src/backend/port/pg_sema.c") CONFIG_LINKS="$CONFIG_LINKS src/backend/port/pg_sema.c:${SEMA_IMPLEMENTATION}" ;;
     "src/backend/port/pg_shmem.c") CONFIG_LINKS="$CONFIG_LINKS src/backend/port/pg_shmem.c:${SHMEM_IMPLEMENTATION}" ;;
-    "src/backend/port/pg_latch.c") CONFIG_LINKS="$CONFIG_LINKS src/backend/port/pg_latch.c:${LATCH_IMPLEMENTATION}" ;;
     "src/include/dynloader.h") CONFIG_LINKS="$CONFIG_LINKS src/include/dynloader.h:src/backend/port/dynloader/${template}.h" ;;
     "src/include/pg_config_os.h") CONFIG_LINKS="$CONFIG_LINKS src/include/pg_config_os.h:src/include/port/${template}.h" ;;
     "src/Makefile.port") CONFIG_LINKS="$CONFIG_LINKS src/Makefile.port:src/makefiles/Makefile.${template}" ;;
diff --git a/configure.in b/configure.in
index c298926..47d0f58 100644
--- a/configure.in
+++ b/configure.in
@@ -1976,13 +1976,6 @@ else
   SHMEM_IMPLEMENTATION="src/backend/port/win32_shmem.c"
 fi
 
-# Select latch implementation type.
-if test "$PORTNAME" != "win32"; then
-  LATCH_IMPLEMENTATION="src/backend/port/unix_latch.c"
-else
-  LATCH_IMPLEMENTATION="src/backend/port/win32_latch.c"
-fi
-
 # If not set in template file, set bytes to use libc memset()
 if test x"$MEMSET_LOOP_LIMIT" = x"" ; then
   MEMSET_LOOP_LIMIT=1024
@@ -2178,7 +2171,6 @@ AC_CONFIG_LINKS([
   src/backend/port/dynloader.c:src/backend/port/dynloader/${template}.c
   src/backend/port/pg_sema.c:${SEMA_IMPLEMENTATION}
   src/backend/port/pg_shmem.c:${SHMEM_IMPLEMENTATION}
-  src/backend/port/pg_latch.c:${LATCH_IMPLEMENTATION}
   src/include/dynloader.h:src/backend/port/dynloader/${template}.h
   src/include/pg_config_os.h:src/include/port/${template}.h
   src/Makefile.port:src/makefiles/Makefile.${template}
diff --git a/src/backend/Makefile b/src/backend/Makefile
index b3d5e2e..d22dbbf 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -306,8 +306,7 @@ ifeq ($(PORTNAME), win32)
 endif
 
 distclean: clean
-	rm -f port/tas.s port/dynloader.c port/pg_sema.c port/pg_shmem.c \
-	      port/pg_latch.c
+	rm -f port/tas.s port/dynloader.c port/pg_sema.c port/pg_shmem.c
 
 maintainer-clean: distclean
 	rm -f bootstrap/bootparse.c \
diff --git a/src/backend/port/.gitignore b/src/backend/port/.gitignore
index 7d3ac4a..9f4f1af 100644
--- a/src/backend/port/.gitignore
+++ b/src/backend/port/.gitignore
@@ -1,5 +1,4 @@
 /dynloader.c
-/pg_latch.c
 /pg_sema.c
 /pg_shmem.c
 /tas.s
diff --git a/src/backend/port/Makefile b/src/backend/port/Makefile
index c6b1d20..89549d0 100644
--- a/src/backend/port/Makefile
+++ b/src/backend/port/Makefile
@@ -21,7 +21,7 @@ subdir = src/backend/port
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = atomics.o dynloader.o pg_sema.o pg_shmem.o pg_latch.o $(TAS)
+OBJS = atomics.o dynloader.o pg_sema.o pg_shmem.o $(TAS)
 
 ifeq ($(PORTNAME), darwin)
 SUBDIRS += darwin
diff --git a/src/backend/port/win32_latch.c b/src/backend/port/win32_latch.c
deleted file mode 100644
index bbf1b24..0000000
--- a/src/backend/port/win32_latch.c
+++ /dev/null
@@ -1,349 +0,0 @@
-/*-------------------------------------------------------------------------
- *
- * win32_latch.c
- *	  Routines for inter-process latches
- *
- * See unix_latch.c for header comments for the exported functions;
- * the API presented here is supposed to be the same as there.
- *
- * The Windows implementation uses Windows events that are inherited by
- * all postmaster child processes.
- *
- * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
- * Portions Copyright (c) 1994, Regents of the University of California
- *
- * IDENTIFICATION
- *	  src/backend/port/win32_latch.c
- *
- *-------------------------------------------------------------------------
- */
-#include "postgres.h"
-
-#include <fcntl.h>
-#include <limits.h>
-#include <signal.h>
-#include <unistd.h>
-
-#include "miscadmin.h"
-#include "portability/instr_time.h"
-#include "postmaster/postmaster.h"
-#include "storage/barrier.h"
-#include "storage/latch.h"
-#include "storage/pmsignal.h"
-#include "storage/shmem.h"
-
-
-void
-InitializeLatchSupport(void)
-{
-	/* currently, nothing to do here for Windows */
-}
-
-void
-InitLatch(volatile Latch *latch)
-{
-	latch->is_set = false;
-	latch->owner_pid = MyProcPid;
-	latch->is_shared = false;
-
-	latch->event = CreateEvent(NULL, TRUE, FALSE, NULL);
-	if (latch->event == NULL)
-		elog(ERROR, "CreateEvent failed: error code %lu", GetLastError());
-}
-
-void
-InitSharedLatch(volatile Latch *latch)
-{
-	SECURITY_ATTRIBUTES sa;
-
-	latch->is_set = false;
-	latch->owner_pid = 0;
-	latch->is_shared = true;
-
-	/*
-	 * Set up security attributes to specify that the events are inherited.
-	 */
-	ZeroMemory(&sa, sizeof(sa));
-	sa.nLength = sizeof(sa);
-	sa.bInheritHandle = TRUE;
-
-	latch->event = CreateEvent(&sa, TRUE, FALSE, NULL);
-	if (latch->event == NULL)
-		elog(ERROR, "CreateEvent failed: error code %lu", GetLastError());
-}
-
-void
-OwnLatch(volatile Latch *latch)
-{
-	/* Sanity checks */
-	Assert(latch->is_shared);
-	if (latch->owner_pid != 0)
-		elog(ERROR, "latch already owned");
-
-	latch->owner_pid = MyProcPid;
-}
-
-void
-DisownLatch(volatile Latch *latch)
-{
-	Assert(latch->is_shared);
-	Assert(latch->owner_pid == MyProcPid);
-
-	latch->owner_pid = 0;
-}
-
-int
-WaitLatch(volatile Latch *latch, int wakeEvents, long timeout)
-{
-	return WaitLatchOrSocket(latch, wakeEvents, PGINVALID_SOCKET, timeout);
-}
-
-int
-WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
-				  long timeout)
-{
-	DWORD		rc;
-	instr_time	start_time,
-				cur_time;
-	long		cur_timeout;
-	HANDLE		events[4];
-	HANDLE		latchevent;
-	HANDLE		sockevent = WSA_INVALID_EVENT;
-	int			numevents;
-	int			result = 0;
-	int			pmdeath_eventno = 0;
-
-	Assert(wakeEvents != 0);	/* must have at least one wake event */
-
-	/* waiting for socket readiness without a socket indicates a bug */
-	if (sock == PGINVALID_SOCKET &&
-		(wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
-		elog(ERROR, "cannot wait on socket event without a socket");
-
-	if ((wakeEvents & WL_LATCH_SET) && latch->owner_pid != MyProcPid)
-		elog(ERROR, "cannot wait on a latch owned by another process");
-
-	/*
-	 * Initialize timeout if requested.  We must record the current time so
-	 * that we can determine the remaining timeout if WaitForMultipleObjects
-	 * is interrupted.
-	 */
-	if (wakeEvents & WL_TIMEOUT)
-	{
-		INSTR_TIME_SET_CURRENT(start_time);
-		Assert(timeout >= 0 && timeout <= INT_MAX);
-		cur_timeout = timeout;
-	}
-	else
-		cur_timeout = INFINITE;
-
-	/*
-	 * Construct an array of event handles for WaitforMultipleObjects().
-	 *
-	 * Note: pgwin32_signal_event should be first to ensure that it will be
-	 * reported when multiple events are set.  We want to guarantee that
-	 * pending signals are serviced.
-	 */
-	latchevent = latch->event;
-
-	events[0] = pgwin32_signal_event;
-	events[1] = latchevent;
-	numevents = 2;
-	if (wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))
-	{
-		/* Need an event object to represent events on the socket */
-		int			flags = FD_CLOSE;	/* always check for errors/EOF */
-
-		if (wakeEvents & WL_SOCKET_READABLE)
-			flags |= FD_READ;
-		if (wakeEvents & WL_SOCKET_WRITEABLE)
-			flags |= FD_WRITE;
-
-		sockevent = WSACreateEvent();
-		if (sockevent == WSA_INVALID_EVENT)
-			elog(ERROR, "failed to create event for socket: error code %u",
-				 WSAGetLastError());
-		if (WSAEventSelect(sock, sockevent, flags) != 0)
-			elog(ERROR, "failed to set up event for socket: error code %u",
-				 WSAGetLastError());
-
-		events[numevents++] = sockevent;
-	}
-	if (wakeEvents & WL_POSTMASTER_DEATH)
-	{
-		pmdeath_eventno = numevents;
-		events[numevents++] = PostmasterHandle;
-	}
-
-	/* Ensure that signals are serviced even if latch is already set */
-	pgwin32_dispatch_queued_signals();
-
-	do
-	{
-		/*
-		 * The comment in unix_latch.c's equivalent to this applies here as
-		 * well. At least after mentally replacing self-pipe with windows
-		 * event. There's no danger of overflowing, as "Setting an event that
-		 * is already set has no effect.".
-		 */
-		if ((wakeEvents & WL_LATCH_SET) && latch->is_set)
-		{
-			result |= WL_LATCH_SET;
-
-			/*
-			 * Leave loop immediately, avoid blocking again. We don't attempt
-			 * to report any other events that might also be satisfied.
-			 */
-			break;
-		}
-
-		rc = WaitForMultipleObjects(numevents, events, FALSE, cur_timeout);
-
-		if (rc == WAIT_FAILED)
-			elog(ERROR, "WaitForMultipleObjects() failed: error code %lu",
-				 GetLastError());
-		else if (rc == WAIT_TIMEOUT)
-		{
-			result |= WL_TIMEOUT;
-		}
-		else if (rc == WAIT_OBJECT_0)
-		{
-			/* Service newly-arrived signals */
-			pgwin32_dispatch_queued_signals();
-		}
-		else if (rc == WAIT_OBJECT_0 + 1)
-		{
-			/*
-			 * Reset the event.  We'll re-check the, potentially, set latch on
-			 * next iteration of loop, but let's not waste the cycles to
-			 * update cur_timeout below.
-			 */
-			if (!ResetEvent(latchevent))
-				elog(ERROR, "ResetEvent failed: error code %lu", GetLastError());
-
-			continue;
-		}
-		else if ((wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) &&
-				 rc == WAIT_OBJECT_0 + 2)		/* socket is at event slot 2 */
-		{
-			WSANETWORKEVENTS resEvents;
-
-			ZeroMemory(&resEvents, sizeof(resEvents));
-			if (WSAEnumNetworkEvents(sock, sockevent, &resEvents) != 0)
-				elog(ERROR, "failed to enumerate network events: error code %u",
-					 WSAGetLastError());
-			if ((wakeEvents & WL_SOCKET_READABLE) &&
-				(resEvents.lNetworkEvents & FD_READ))
-			{
-				result |= WL_SOCKET_READABLE;
-			}
-			if ((wakeEvents & WL_SOCKET_WRITEABLE) &&
-				(resEvents.lNetworkEvents & FD_WRITE))
-			{
-				result |= WL_SOCKET_WRITEABLE;
-			}
-			if (resEvents.lNetworkEvents & FD_CLOSE)
-			{
-				if (wakeEvents & WL_SOCKET_READABLE)
-					result |= WL_SOCKET_READABLE;
-				if (wakeEvents & WL_SOCKET_WRITEABLE)
-					result |= WL_SOCKET_WRITEABLE;
-			}
-		}
-		else if ((wakeEvents & WL_POSTMASTER_DEATH) &&
-				 rc == WAIT_OBJECT_0 + pmdeath_eventno)
-		{
-			/*
-			 * Postmaster apparently died.  Since the consequences of falsely
-			 * returning WL_POSTMASTER_DEATH could be pretty unpleasant, we
-			 * take the trouble to positively verify this with
-			 * PostmasterIsAlive(), even though there is no known reason to
-			 * think that the event could be falsely set on Windows.
-			 */
-			if (!PostmasterIsAlive())
-				result |= WL_POSTMASTER_DEATH;
-		}
-		else
-			elog(ERROR, "unexpected return code from WaitForMultipleObjects(): %lu", rc);
-
-		/* If we're not done, update cur_timeout for next iteration */
-		if (result == 0 && (wakeEvents & WL_TIMEOUT))
-		{
-			INSTR_TIME_SET_CURRENT(cur_time);
-			INSTR_TIME_SUBTRACT(cur_time, start_time);
-			cur_timeout = timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
-			if (cur_timeout <= 0)
-			{
-				/* Timeout has expired, no need to continue looping */
-				result |= WL_TIMEOUT;
-			}
-		}
-	} while (result == 0);
-
-	/* Clean up the event object we created for the socket */
-	if (sockevent != WSA_INVALID_EVENT)
-	{
-		WSAEventSelect(sock, NULL, 0);
-		WSACloseEvent(sockevent);
-	}
-
-	return result;
-}
-
-/*
- * The comments above the unix implementation (unix_latch.c) of this function
- * apply here as well.
- */
-void
-SetLatch(volatile Latch *latch)
-{
-	HANDLE		handle;
-
-	/*
-	 * The memory barrier has be to be placed here to ensure that any flag
-	 * variables possibly changed by this process have been flushed to main
-	 * memory, before we check/set is_set.
-	 */
-	pg_memory_barrier();
-
-	/* Quick exit if already set */
-	if (latch->is_set)
-		return;
-
-	latch->is_set = true;
-
-	/*
-	 * See if anyone's waiting for the latch. It can be the current process if
-	 * we're in a signal handler.
-	 *
-	 * Use a local variable here just in case somebody changes the event field
-	 * concurrently (which really should not happen).
-	 */
-	handle = latch->event;
-	if (handle)
-	{
-		SetEvent(handle);
-
-		/*
-		 * Note that we silently ignore any errors. We might be in a signal
-		 * handler or other critical path where it's not safe to call elog().
-		 */
-	}
-}
-
-void
-ResetLatch(volatile Latch *latch)
-{
-	/* Only the owner should reset the latch */
-	Assert(latch->owner_pid == MyProcPid);
-
-	latch->is_set = false;
-
-	/*
-	 * Ensure that the write to is_set gets flushed to main memory before we
-	 * examine any flag variables.  Otherwise a concurrent SetLatch might
-	 * falsely conclude that it needn't signal us, even though we have missed
-	 * seeing some flag updates that SetLatch was supposed to inform us of.
-	 */
-	pg_memory_barrier();
-}
diff --git a/src/backend/storage/ipc/Makefile b/src/backend/storage/ipc/Makefile
index d8eb742..8a55392 100644
--- a/src/backend/storage/ipc/Makefile
+++ b/src/backend/storage/ipc/Makefile
@@ -8,7 +8,8 @@ subdir = src/backend/storage/ipc
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = dsm_impl.o dsm.o ipc.o ipci.o pmsignal.o procarray.o procsignal.o \
-	shmem.o shmqueue.o shm_mq.o shm_toc.o sinval.o sinvaladt.o standby.o
+OBJS = dsm_impl.o dsm.o ipc.o ipci.o latch.o pmsignal.o procarray.o \
+	procsignal.o  shmem.o shmqueue.o shm_mq.o shm_toc.o sinval.o \
+	sinvaladt.o standby.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/port/unix_latch.c b/src/backend/storage/ipc/latch.c
similarity index 74%
rename from src/backend/port/unix_latch.c
rename to src/backend/storage/ipc/latch.c
index 63b76c6..865fac8 100644
--- a/src/backend/port/unix_latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -1,6 +1,6 @@
 /*-------------------------------------------------------------------------
  *
- * unix_latch.c
+ * latch.c
  *	  Routines for inter-process latches
  *
  * The Unix implementation uses the so-called self-pipe trick to overcome
@@ -22,11 +22,14 @@
  * process, SIGUSR1 is sent and the signal handler in the waiting process
  * writes the byte to the pipe on behalf of the signaling process.
  *
+ * The Windows implementation uses Windows events that are inherited by
+ * all postmaster child processes.
+ *
  * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
  * IDENTIFICATION
- *	  src/backend/port/unix_latch.c
+ *	  src/backend/storage/ipc/latch.c
  *
  *-------------------------------------------------------------------------
  */
@@ -62,16 +65,19 @@
  * useful to manually specify the used primitive.  If desired, just add a
  * define somewhere before this block.
  */
-#if defined(LATCH_USE_POLL) || defined(LATCH_USE_SELECT)
+#if defined(LATCH_USE_POLL) || defined(LATCH_USE_SELECT) || defined(LATCH_USE_WIN32)
 /* don't overwrite manual choice */
 #elif defined(HAVE_POLL)
 #define LATCH_USE_POLL
 #elif HAVE_SYS_SELECT_H
 #define LATCH_USE_SELECT
+#elif WIN32
+#define LATCH_USE_WIN32
 #else
 #error "no latch implementation available"
 #endif
 
+#ifndef WIN32
 /* Are we currently in WaitLatch? The signal handler would like to know. */
 static volatile sig_atomic_t waiting = false;
 
@@ -82,6 +88,7 @@ static int	selfpipe_writefd = -1;
 /* Private function prototypes */
 static void sendSelfPipeByte(void);
 static void drainSelfPipe(void);
+#endif   /* WIN32 */
 
 
 /*
@@ -93,6 +100,7 @@ static void drainSelfPipe(void);
 void
 InitializeLatchSupport(void)
 {
+#ifndef WIN32
 	int			pipefd[2];
 
 	Assert(selfpipe_readfd == -1);
@@ -113,6 +121,9 @@ InitializeLatchSupport(void)
 
 	selfpipe_readfd = pipefd[0];
 	selfpipe_writefd = pipefd[1];
+#else
+	/* currently, nothing to do here for Windows */
+#endif
 }
 
 /*
@@ -121,12 +132,18 @@ InitializeLatchSupport(void)
 void
 InitLatch(volatile Latch *latch)
 {
-	/* Assert InitializeLatchSupport has been called in this process */
-	Assert(selfpipe_readfd >= 0);
-
 	latch->is_set = false;
 	latch->owner_pid = MyProcPid;
 	latch->is_shared = false;
+
+#ifndef WIN32
+	/* Assert InitializeLatchSupport has been called in this process */
+	Assert(selfpipe_readfd >= 0);
+#else
+	latch->event = CreateEvent(NULL, TRUE, FALSE, NULL);
+	if (latch->event == NULL)
+		elog(ERROR, "CreateEvent failed: error code %lu", GetLastError());
+#endif   /* WIN32 */
 }
 
 /*
@@ -143,6 +160,21 @@ InitLatch(volatile Latch *latch)
 void
 InitSharedLatch(volatile Latch *latch)
 {
+#ifdef WIN32
+	SECURITY_ATTRIBUTES sa;
+
+	/*
+	 * Set up security attributes to specify that the events are inherited.
+	 */
+	ZeroMemory(&sa, sizeof(sa));
+	sa.nLength = sizeof(sa);
+	sa.bInheritHandle = TRUE;
+
+	latch->event = CreateEvent(&sa, TRUE, FALSE, NULL);
+	if (latch->event == NULL)
+		elog(ERROR, "CreateEvent failed: error code %lu", GetLastError());
+#endif
+
 	latch->is_set = false;
 	latch->owner_pid = 0;
 	latch->is_shared = true;
@@ -164,12 +196,14 @@ InitSharedLatch(volatile Latch *latch)
 void
 OwnLatch(volatile Latch *latch)
 {
-	/* Assert InitializeLatchSupport has been called in this process */
-	Assert(selfpipe_readfd >= 0);
-
+	/* Sanity checks */
 	Assert(latch->is_shared);
 
-	/* sanity check */
+#ifndef WIN32
+	/* Assert InitializeLatchSupport has been called in this process */
+	Assert(selfpipe_readfd >= 0);
+#endif
+
 	if (latch->owner_pid != 0)
 		elog(ERROR, "latch already owned");
 
@@ -221,6 +255,7 @@ WaitLatch(volatile Latch *latch, int wakeEvents, long timeout)
  * returning the socket as readable/writable or both, depending on
  * WL_SOCKET_READABLE/WL_SOCKET_WRITEABLE being specified.
  */
+#ifndef LATCH_USE_WIN32
 int
 WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 				  long timeout)
@@ -551,6 +586,198 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 
 	return result;
 }
+#else /* LATCH_USE_WIN32 */
+int
+WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
+				  long timeout)
+{
+	DWORD		rc;
+	instr_time	start_time,
+				cur_time;
+	long		cur_timeout;
+	HANDLE		events[4];
+	HANDLE		latchevent;
+	HANDLE		sockevent = WSA_INVALID_EVENT;
+	int			numevents;
+	int			result = 0;
+	int			pmdeath_eventno = 0;
+
+	Assert(wakeEvents != 0);	/* must have at least one wake event */
+
+	/* waiting for socket readiness without a socket indicates a bug */
+	if (sock == PGINVALID_SOCKET &&
+		(wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
+		elog(ERROR, "cannot wait on socket events without a socket");
+
+	if ((wakeEvents & WL_LATCH_SET) && latch->owner_pid != MyProcPid)
+		elog(ERROR, "cannot wait on a latch owned by another process");
+
+	/*
+	 * Initialize timeout if requested.  We must record the current time so
+	 * that we can determine the remaining timeout if WaitForMultipleObjects
+	 * is interrupted.
+	 */
+	if (wakeEvents & WL_TIMEOUT)
+	{
+		INSTR_TIME_SET_CURRENT(start_time);
+		Assert(timeout >= 0 && timeout <= INT_MAX);
+		cur_timeout = timeout;
+	}
+	else
+		cur_timeout = INFINITE;
+
+	/*
+	 * Construct an array of event handles for WaitforMultipleObjects().
+	 *
+	 * Note: pgwin32_signal_event should be first to ensure that it will be
+	 * reported when multiple events are set.  We want to guarantee that
+	 * pending signals are serviced.
+	 */
+	latchevent = latch->event;
+
+	events[0] = pgwin32_signal_event;
+	events[1] = latchevent;
+	numevents = 2;
+	if (wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))
+	{
+		/* Need an event object to represent events on the socket */
+		int			flags = FD_CLOSE;	/* always check for errors/EOF */
+
+		if (wakeEvents & WL_SOCKET_READABLE)
+			flags |= FD_READ;
+		if (wakeEvents & WL_SOCKET_WRITEABLE)
+			flags |= FD_WRITE;
+
+		sockevent = WSACreateEvent();
+		if (sockevent == WSA_INVALID_EVENT)
+			elog(ERROR, "failed to create event for socket: error code %u",
+				 WSAGetLastError());
+		if (WSAEventSelect(sock, sockevent, flags) != 0)
+			elog(ERROR, "failed to set up event for socket: error code %u",
+				 WSAGetLastError());
+
+		events[numevents++] = sockevent;
+	}
+	if (wakeEvents & WL_POSTMASTER_DEATH)
+	{
+		pmdeath_eventno = numevents;
+		events[numevents++] = PostmasterHandle;
+	}
+
+	/* Ensure that signals are serviced even if latch is already set */
+	pgwin32_dispatch_queued_signals();
+
+	do
+	{
+		/*
+		 * Reset the event, and check if the latch is set already. If someone
+		 * sets the latch between this and the WaitForMultipleObjects() call
+		 * below, the setter will set the event and WaitForMultipleObjects()
+		 * will return immediately.
+		 */
+		if (!ResetEvent(latchevent))
+			elog(ERROR, "ResetEvent failed: error code %lu", GetLastError());
+
+		if ((wakeEvents & WL_LATCH_SET) && latch->is_set)
+		{
+			result |= WL_LATCH_SET;
+
+			/*
+			 * Leave loop immediately, avoid blocking again. We don't attempt
+			 * to report any other events that might also be satisfied.
+			 */
+			break;
+		}
+
+		rc = WaitForMultipleObjects(numevents, events, FALSE, cur_timeout);
+
+		if (rc == WAIT_FAILED)
+			elog(ERROR, "WaitForMultipleObjects() failed: error code %lu",
+				 GetLastError());
+		else if (rc == WAIT_TIMEOUT)
+		{
+			result |= WL_TIMEOUT;
+		}
+		else if (rc == WAIT_OBJECT_0)
+		{
+			/* Service newly-arrived signals */
+			pgwin32_dispatch_queued_signals();
+		}
+		else if (rc == WAIT_OBJECT_0 + 1)
+		{
+			/*
+			 * Latch is set.  We'll handle that on next iteration of loop, but
+			 * let's not waste the cycles to update cur_timeout below.
+			 */
+			continue;
+		}
+		else if ((wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) &&
+				 rc == WAIT_OBJECT_0 + 2)		/* socket is at event slot 2 */
+		{
+			WSANETWORKEVENTS resEvents;
+
+			ZeroMemory(&resEvents, sizeof(resEvents));
+			if (WSAEnumNetworkEvents(sock, sockevent, &resEvents) != 0)
+				elog(ERROR, "failed to enumerate network events: error code %u",
+					 WSAGetLastError());
+			if ((wakeEvents & WL_SOCKET_READABLE) &&
+				(resEvents.lNetworkEvents & FD_READ))
+			{
+				result |= WL_SOCKET_READABLE;
+			}
+			if ((wakeEvents & WL_SOCKET_WRITEABLE) &&
+				(resEvents.lNetworkEvents & FD_WRITE))
+			{
+				result |= WL_SOCKET_WRITEABLE;
+			}
+			if (resEvents.lNetworkEvents & FD_CLOSE)
+			{
+				if (wakeEvents & WL_SOCKET_READABLE)
+					result |= WL_SOCKET_READABLE;
+				if (wakeEvents & WL_SOCKET_WRITEABLE)
+					result |= WL_SOCKET_WRITEABLE;
+			}
+		}
+		else if ((wakeEvents & WL_POSTMASTER_DEATH) &&
+				 rc == WAIT_OBJECT_0 + pmdeath_eventno)
+		{
+			/*
+			 * Postmaster apparently died.  Since the consequences of falsely
+			 * returning WL_POSTMASTER_DEATH could be pretty unpleasant, we
+			 * take the trouble to positively verify this with
+			 * PostmasterIsAlive(), even though there is no known reason to
+			 * think that the event could be falsely set on Windows.
+			 */
+			if (!PostmasterIsAlive())
+				result |= WL_POSTMASTER_DEATH;
+		}
+		else
+			elog(ERROR, "unexpected return code from WaitForMultipleObjects(): %lu", rc);
+
+		/* If we're not done, update cur_timeout for next iteration */
+		if (result == 0 && (wakeEvents & WL_TIMEOUT))
+		{
+			INSTR_TIME_SET_CURRENT(cur_time);
+			INSTR_TIME_SUBTRACT(cur_time, start_time);
+			cur_timeout = timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
+			if (cur_timeout <= 0)
+			{
+				/* Timeout has expired, no need to continue looping */
+				result |= WL_TIMEOUT;
+			}
+		}
+	} while (result == 0);
+
+	/* Clean up the event object we created for the socket */
+	if (sockevent != WSA_INVALID_EVENT)
+	{
+		WSAEventSelect(sock, NULL, 0);
+		WSACloseEvent(sockevent);
+	}
+
+	return result;
+}
+#endif /* LATCH_USE_WIN32 */
 
 /*
  * Sets a latch and wakes up anyone waiting on it.
@@ -567,7 +794,11 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 void
 SetLatch(volatile Latch *latch)
 {
+#ifndef WIN32
 	pid_t		owner_pid;
+#else
+	HANDLE		handle;
+#endif
 
 	/*
 	 * The memory barrier has be to be placed here to ensure that any flag
@@ -582,6 +813,8 @@ SetLatch(volatile Latch *latch)
 
 	latch->is_set = true;
 
+#ifndef WIN32
+
 	/*
 	 * See if anyone's waiting for the latch. It can be the current process if
 	 * we're in a signal handler. We use the self-pipe to wake up the select()
@@ -613,6 +846,27 @@ SetLatch(volatile Latch *latch)
 	}
 	else
 		kill(owner_pid, SIGUSR1);
+#else
+
+	/*
+	 * See if anyone's waiting for the latch. It can be the current process if
+	 * we're in a signal handler.
+	 *
+	 * Use a local variable here just in case somebody changes the event field
+	 * concurrently (which really should not happen).
+	 */
+	handle = latch->event;
+	if (handle)
+	{
+		SetEvent(handle);
+
+		/*
+		 * Note that we silently ignore any errors. We might be in a signal
+		 * handler or other critical path where it's not safe to call elog().
+		 */
+	}
+#endif
+
 }
 
 /*
@@ -646,14 +900,17 @@ ResetLatch(volatile Latch *latch)
  * NB: when calling this in a signal handler, be sure to save and restore
  * errno around it.
  */
+#ifndef WIN32
 void
 latch_sigusr1_handler(void)
 {
 	if (waiting)
 		sendSelfPipeByte();
 }
+#endif   /* !WIN32 */
 
 /* Send one byte to the self-pipe, to wake up WaitLatch */
+#ifndef WIN32
 static void
 sendSelfPipeByte(void)
 {
@@ -683,6 +940,7 @@ retry:
 		return;
 	}
 }
+#endif   /* !WIN32 */
 
 /*
  * Read all available data from the self-pipe
@@ -691,6 +949,7 @@ retry:
  * return, it must reset that flag first (though ideally, this will never
  * happen).
  */
+#ifndef WIN32
 static void
 drainSelfPipe(void)
 {
@@ -729,3 +988,4 @@ drainSelfPipe(void)
 		/* else buffer wasn't big enough, so read again */
 	}
 }
+#endif   /* !WIN32 */
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 737e11d..1b9521f 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -36,7 +36,7 @@
  * WaitLatch includes a provision for timeouts (which should be avoided
  * when possible, as they incur extra overhead) and a provision for
  * postmaster child processes to wake up immediately on postmaster death.
- * See unix_latch.c for detailed specifications for the exported functions.
+ * See latch.c for detailed specifications for the exported functions.
  *
  * The correct pattern to wait for event(s) is:
  *
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index 12f3bc6..e854af2 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -134,8 +134,6 @@ sub mkvcbuild
 		'src/backend/port/win32_sema.c');
 	$postgres->ReplaceFile('src/backend/port/pg_shmem.c',
 		'src/backend/port/win32_shmem.c');
-	$postgres->ReplaceFile('src/backend/port/pg_latch.c',
-		'src/backend/port/win32_latch.c');
 	$postgres->AddFiles('src/port',   @pgportfiles);
 	$postgres->AddFiles('src/common', @pgcommonbkndfiles);
 	$postgres->AddDir('src/timezone');
-- 
2.7.0.229.g701fa7f

0002-Introduce-new-WaitEventSet-API.patchtext/x-patch; charset=us-asciiDownload

From 1919935d0b12b5104c9294ddf7ad9126ed976451 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 18 Mar 2016 12:01:54 -0700
Subject: [PATCH 2/2] Introduce new WaitEventSet API.

Commit ac1d794 ("Make idle backends exit if the postmaster dies.")
introduced a regression on, at least, large linux systems. Constantly
adding the same postmaster_alive_fds to the OSs internal datastructures
for implementing poll/select can cause significant contention; leading
to a performance regression of nearly 3x in one example.

This can be avoided by using e.g. linux' epoll, which avoids having to
add/remove file descriptors to the wait datastructures at a high rate.
Unfortunately the current latch interface makes it hard to allocate any
persistent per-backend resources.

Replace, with a backward compatibility layer, WaitLatchOrSocket with a
new WaitEventSet API. Users can allocate such a Set across multiple
calls, and add more than one filedescriptor to wait on. The latter has
been added because there's upcoming postgres features where that will be
helpful.

In addition to the previously existing poll(2), select(2),
WaitForMultipleObjects() implementations also provide an epoll_wait(2)
based implementation to address the aforementioned performance
problem. Epoll is only available on linux, but that is the most likely
OS for machines large enough (four sockets) to reproduce the problem.

To actually address the aforementioend regression, reate and use a
long-lived WaitEventSet for FE/BE communication.

Reported-By: Dmitry Vasilyev
Discussion: CAB-SwXZh44_2ybvS5Z67p_CDz=XFn4hNAD=CnMEF+QqkXwFrGg@mail.gmail.com
    20160114143931.GG10941@awork2.anarazel.de
---
 configure                         |    2 +-
 configure.in                      |    2 +-
 src/backend/libpq/be-secure.c     |   24 +-
 src/backend/libpq/pqcomm.c        |    4 +
 src/backend/storage/ipc/latch.c   | 1577 +++++++++++++++++++++++++------------
 src/backend/utils/init/miscinit.c |    8 +
 src/include/libpq/libpq.h         |    3 +
 src/include/pg_config.h.in        |    3 +
 src/include/storage/latch.h       |   35 +-
 src/tools/pgindent/typedefs.list  |    2 +
 10 files changed, 1134 insertions(+), 526 deletions(-)

diff --git a/configure b/configure
index c10d954..24655dc 100755
--- a/configure
+++ b/configure
@@ -10193,7 +10193,7 @@ fi
 ## Header files
 ##
 
-for ac_header in atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h pwd.h sys/ioctl.h sys/ipc.h sys/poll.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/socket.h sys/sockio.h sys/tas.h sys/time.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h
+for ac_header in atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h pwd.h sys/epoll.h sys/ioctl.h sys/ipc.h sys/poll.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/socket.h sys/sockio.h sys/tas.h sys/time.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h
 do :
   as_ac_Header=`$as_echo "ac_cv_header_$ac_header" | $as_tr_sh`
 ac_fn_c_check_header_mongrel "$LINENO" "$ac_header" "$as_ac_Header" "$ac_includes_default"
diff --git a/configure.in b/configure.in
index 47d0f58..c564a76 100644
--- a/configure.in
+++ b/configure.in
@@ -1183,7 +1183,7 @@ AC_SUBST(UUID_LIBS)
 ##
 
 dnl sys/socket.h is required by AC_FUNC_ACCEPT_ARGTYPES
-AC_CHECK_HEADERS([atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h pwd.h sys/ioctl.h sys/ipc.h sys/poll.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/socket.h sys/sockio.h sys/tas.h sys/time.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h])
+AC_CHECK_HEADERS([atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h pwd.h sys/epoll.h sys/ioctl.h sys/ipc.h sys/poll.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/socket.h sys/sockio.h sys/tas.h sys/time.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h])
 
 # On BSD, test for net/if.h will fail unless sys/socket.h
 # is included first.
diff --git a/src/backend/libpq/be-secure.c b/src/backend/libpq/be-secure.c
index ac709d1..29297e7 100644
--- a/src/backend/libpq/be-secure.c
+++ b/src/backend/libpq/be-secure.c
@@ -140,13 +140,13 @@ retry:
 	/* In blocking mode, wait until the socket is ready */
 	if (n < 0 && !port->noblock && (errno == EWOULDBLOCK || errno == EAGAIN))
 	{
-		int			w;
+		WaitEvent   event;
 
 		Assert(waitfor);
 
-		w = WaitLatchOrSocket(MyLatch,
-							  WL_LATCH_SET | WL_POSTMASTER_DEATH | waitfor,
-							  port->sock, 0);
+		ModifyWaitEvent(FeBeWaitSet, 0, waitfor, NULL);
+
+		WaitEventSetWait(FeBeWaitSet, -1 /* no timeout */, &event, 1);
 
 		/*
 		 * If the postmaster has died, it's not safe to continue running,
@@ -165,13 +165,13 @@ retry:
 		 * cycles checking for this very rare condition, and this should cause
 		 * us to exit quickly in most cases.)
 		 */
-		if (w & WL_POSTMASTER_DEATH)
+		if (event.events & WL_POSTMASTER_DEATH)
 			ereport(FATAL,
 					(errcode(ERRCODE_ADMIN_SHUTDOWN),
 					errmsg("terminating connection due to unexpected postmaster exit")));
 
 		/* Handle interrupt. */
-		if (w & WL_LATCH_SET)
+		if (event.events & WL_LATCH_SET)
 		{
 			ResetLatch(MyLatch);
 			ProcessClientReadInterrupt(true);
@@ -241,22 +241,22 @@ retry:
 
 	if (n < 0 && !port->noblock && (errno == EWOULDBLOCK || errno == EAGAIN))
 	{
-		int			w;
+		WaitEvent   event;
 
 		Assert(waitfor);
 
-		w = WaitLatchOrSocket(MyLatch,
-							  WL_LATCH_SET | WL_POSTMASTER_DEATH | waitfor,
-							  port->sock, 0);
+		ModifyWaitEvent(FeBeWaitSet, 0, waitfor, NULL);
+
+		WaitEventSetWait(FeBeWaitSet, -1 /* no timeout */, &event, 1);
 
 		/* See comments in secure_read. */
-		if (w & WL_POSTMASTER_DEATH)
+		if (event.events & WL_POSTMASTER_DEATH)
 			ereport(FATAL,
 					(errcode(ERRCODE_ADMIN_SHUTDOWN),
 					errmsg("terminating connection due to unexpected postmaster exit")));
 
 		/* Handle interrupt. */
-		if (w & WL_LATCH_SET)
+		if (event.events & WL_LATCH_SET)
 		{
 			ResetLatch(MyLatch);
 			ProcessClientWriteInterrupt(true);
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 71473db..c81abaf 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -201,6 +201,10 @@ pq_init(void)
 				(errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
+	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+	AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock, NULL);
+	AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch);
+	AddWaitEventToSet(FeBeWaitSet, WL_POSTMASTER_DEATH, -1, NULL);
 }
 
 /* --------------------------------
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 865fac8..d8678c6 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -14,8 +14,8 @@
  * however reliably interrupts the sleep, and causes select() to return
  * immediately even if the signal arrives before select() begins.
  *
- * (Actually, we prefer poll() over select() where available, but the
- * same comments apply to it.)
+ * (Actually, we prefer epoll_wait() over poll() over select() where
+ * available, but the same comments apply.)
  *
  * When SetLatch is called from the same process that owns the latch,
  * SetLatch writes the byte directly to the pipe. If it's owned by another
@@ -41,6 +41,9 @@
 #include <unistd.h>
 #include <sys/time.h>
 #include <sys/types.h>
+#ifdef HAVE_SYS_EPOLL_H
+#include <sys/epoll.h>
+#endif
 #ifdef HAVE_POLL_H
 #include <poll.h>
 #endif
@@ -65,18 +68,58 @@
  * useful to manually specify the used primitive.  If desired, just add a
  * define somewhere before this block.
  */
-#if defined(LATCH_USE_POLL) || defined(LATCH_USE_SELECT) || defined(LATCH_USE_WIN32)
+#if defined(WAIT_USE_EPOLL) || defined(WAIT_USE_POLL) || defined(WAIT_USE_SELECT) || defined(WAIT_USE_WIN32)
 /* don't overwrite manual choice */
+#elif defined(HAVE_SYS_EPOLL_H)
+#define WAIT_USE_EPOLL
 #elif defined(HAVE_POLL)
-#define LATCH_USE_POLL
+#define WAIT_USE_POLL
 #elif HAVE_SYS_SELECT_H
-#define LATCH_USE_SELECT
+#define WAIT_USE_SELECT
 #elif WIN32
-#define LATCH_USE_WIN32
+#define WAIT_USE_WIN32
 #else
-#error "no latch implementation available"
+#error "no wait set implementation available"
 #endif
 
+/* typedef in latch.h */
+struct WaitEventSet
+{
+	int			nevents; /* number of registered events */
+	int			nevents_space; /* maximum number of events in this set */
+
+	/*
+	 * Array, of nevents_space length, storing the definition of events this
+	 * set is waiting for.
+	 */
+	WaitEvent  *events;
+
+	/*
+	 * If WL_LATCH_SET is specified in any wait event, latch is a pointer to
+	 * said latch, and latch_pos the offset in the ->events array. This is
+	 * useful because we check the state of the latch before performing doing
+	 * syscalls related to waiting.
+	 */
+	Latch	   *latch;
+	int			latch_pos;
+
+#if defined(WAIT_USE_EPOLL)
+	int			epoll_fd;
+	/* epoll_wait returns events in a user provided arrays, allocate once */
+	struct epoll_event *epoll_ret_events;
+#elif defined(WAIT_USE_POLL)
+	/* poll expects events to be waited on every poll() call, prepare once */
+	struct pollfd *pollfds;
+#elif defined(WAIT_USE_WIN32)
+	/*
+	 * Array of windows events. The first element always contains
+	 * pgwin32_signal_event, so the remaining elements are offset by one
+	 * (i.e. event->pos + 1).
+	 */
+	HANDLE	   *handles;
+#endif
+};
+
 #ifndef WIN32
 /* Are we currently in WaitLatch? The signal handler would like to know. */
 static volatile sig_atomic_t waiting = false;
@@ -90,6 +133,16 @@ static void sendSelfPipeByte(void);
 static void drainSelfPipe(void);
 #endif   /* WIN32 */
 
+#if defined(WAIT_USE_EPOLL)
+static void WaitEventAdjustEpoll(WaitEventSet *set, WaitEvent *event, int action);
+#elif defined(WAIT_USE_POLL)
+static void WaitEventAdjustPoll(WaitEventSet *set, WaitEvent *event);
+#elif defined(WAIT_USE_WIN32)
+static void WaitEventAdjustWin32(WaitEventSet *set, WaitEvent *event);
+#endif
+
+static int WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
+					  WaitEvent *occurred_events, int nevents);
 
 /*
  * Initialize the process-local latch infrastructure.
@@ -254,530 +307,61 @@ WaitLatch(volatile Latch *latch, int wakeEvents, long timeout)
  * When waiting on a socket, EOF and error conditions are reported by
  * returning the socket as readable/writable or both, depending on
  * WL_SOCKET_READABLE/WL_SOCKET_WRITEABLE being specified.
+ *
+ * NB: These days this is just a wrapper around the WaitEventSet API. When
+ * using a latch very frequently, consider creating a longer living
+ * WaitEventSet instead; that's more efficient.
  */
-#ifndef LATCH_USE_WIN32
 int
 WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 				  long timeout)
 {
-	int			result = 0;
+	int			ret = 0;
 	int			rc;
-	instr_time	start_time,
-				cur_time;
-	long		cur_timeout;
+	WaitEvent	event;
+	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
 
-#if defined(LATCH_USE_POLL)
-	struct pollfd pfds[3];
-	int			nfds;
-#elif defined(LATCH_USE_SELECT)
-	struct timeval tv,
-			   *tvp;
-	fd_set		input_mask;
-	fd_set		output_mask;
-	int			hifd;
-#endif
-
-	Assert(wakeEvents != 0);	/* must have at least one wake event */
+	if (wakeEvents & WL_TIMEOUT)
+		Assert(timeout >= 0);
+	else
+		timeout = -1;
 
 	/* waiting for socket readiness without a socket indicates a bug */
 	if (sock == PGINVALID_SOCKET &&
 		(wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
 		elog(ERROR, "cannot wait on socket event without a socket");
 
-	if ((wakeEvents & WL_LATCH_SET) && latch->owner_pid != MyProcPid)
-		elog(ERROR, "cannot wait on a latch owned by another process");
+	if (wakeEvents & WL_LATCH_SET)
+		AddWaitEventToSet(set, WL_LATCH_SET, PGINVALID_SOCKET,
+						  (Latch *) latch);
 
-	/*
-	 * Initialize timeout if requested.  We must record the current time so
-	 * that we can determine the remaining timeout if the poll() or select()
-	 * is interrupted.  (On some platforms, select() will update the contents
-	 * of "tv" for us, but unfortunately we can't rely on that.)
-	 */
-	if (wakeEvents & WL_TIMEOUT)
-	{
-		INSTR_TIME_SET_CURRENT(start_time);
-		Assert(timeout >= 0 && timeout <= INT_MAX);
-		cur_timeout = timeout;
+	if (wakeEvents & WL_POSTMASTER_DEATH)
+		AddWaitEventToSet(set, WL_POSTMASTER_DEATH, PGINVALID_SOCKET, NULL);
 
-#ifdef LATCH_USE_SELECT
-		tv.tv_sec = cur_timeout / 1000L;
-		tv.tv_usec = (cur_timeout % 1000L) * 1000L;
-		tvp = &tv;
-#endif
-	}
-	else
-	{
-		cur_timeout = -1;
-
-#ifdef LATCH_USE_SELECT
-		tvp = NULL;
-#endif
-	}
-
-	waiting = true;
-	do
-	{
-		/*
-		 * Check if the latch is set already. If so, leave loop immediately,
-		 * avoid blocking again. We don't attempt to report any other events
-		 * that might also be satisfied.
-		 *
-		 * If someone sets the latch between this and the poll()/select()
-		 * below, the setter will write a byte to the pipe (or signal us and
-		 * the signal handler will do that), and the poll()/select() will
-		 * return immediately.
-		 *
-		 * If there's a pending byte in the self pipe, we'll notice whenever
-		 * blocking. Only clearing the pipe in that case avoids having to
-		 * drain it every time WaitLatchOrSocket() is used. Should the
-		 * pipe-buffer fill up we're still ok, because the pipe is in
-		 * nonblocking mode. It's unlikely for that to happen, because the
-		 * self pipe isn't filled unless we're blocking (waiting = true), or
-		 * from inside a signal handler in latch_sigusr1_handler().
-		 *
-		 * Note: we assume that the kernel calls involved in drainSelfPipe()
-		 * and SetLatch() will provide adequate synchronization on machines
-		 * with weak memory ordering, so that we cannot miss seeing is_set if
-		 * the signal byte is already in the pipe when we drain it.
-		 */
-		if ((wakeEvents & WL_LATCH_SET) && latch->is_set)
-		{
-			result |= WL_LATCH_SET;
-			break;
-		}
-
-		/*
-		 * Must wait ... we use the polling interface determined at the top of
-		 * this file to do so.
-		 */
-#if defined(LATCH_USE_POLL)
-		nfds = 0;
-
-		/* selfpipe is always in pfds[0] */
-		pfds[0].fd = selfpipe_readfd;
-		pfds[0].events = POLLIN;
-		pfds[0].revents = 0;
-		nfds++;
-
-		if (wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))
-		{
-			/* socket, if used, is always in pfds[1] */
-			pfds[1].fd = sock;
-			pfds[1].events = 0;
-			if (wakeEvents & WL_SOCKET_READABLE)
-				pfds[1].events |= POLLIN;
-			if (wakeEvents & WL_SOCKET_WRITEABLE)
-				pfds[1].events |= POLLOUT;
-			pfds[1].revents = 0;
-			nfds++;
-		}
-
-		if (wakeEvents & WL_POSTMASTER_DEATH)
-		{
-			/* postmaster fd, if used, is always in pfds[nfds - 1] */
-			pfds[nfds].fd = postmaster_alive_fds[POSTMASTER_FD_WATCH];
-			pfds[nfds].events = POLLIN;
-			pfds[nfds].revents = 0;
-			nfds++;
-		}
-
-		/* Sleep */
-		rc = poll(pfds, nfds, (int) cur_timeout);
-
-		/* Check return code */
-		if (rc < 0)
-		{
-			/* EINTR is okay, otherwise complain */
-			if (errno != EINTR)
-			{
-				waiting = false;
-				ereport(ERROR,
-						(errcode_for_socket_access(),
-						 errmsg("poll() failed: %m")));
-			}
-		}
-		else if (rc == 0)
-		{
-			/* timeout exceeded */
-			if (wakeEvents & WL_TIMEOUT)
-				result |= WL_TIMEOUT;
-		}
-		else
-		{
-			/* at least one event occurred, so check revents values */
-
-			if (pfds[0].revents & POLLIN)
-			{
-				/* There's data in the self-pipe, clear it. */
-				drainSelfPipe();
-			}
-
-			if ((wakeEvents & WL_SOCKET_READABLE) &&
-				(pfds[1].revents & POLLIN))
-			{
-				/* data available in socket, or EOF/error condition */
-				result |= WL_SOCKET_READABLE;
-			}
-			if ((wakeEvents & WL_SOCKET_WRITEABLE) &&
-				(pfds[1].revents & POLLOUT))
-			{
-				/* socket is writable */
-				result |= WL_SOCKET_WRITEABLE;
-			}
-			if ((wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) &&
-				(pfds[1].revents & (POLLHUP | POLLERR | POLLNVAL)))
-			{
-				/* EOF/error condition */
-				if (wakeEvents & WL_SOCKET_READABLE)
-					result |= WL_SOCKET_READABLE;
-				if (wakeEvents & WL_SOCKET_WRITEABLE)
-					result |= WL_SOCKET_WRITEABLE;
-			}
-
-			/*
-			 * We expect a POLLHUP when the remote end is closed, but because
-			 * we don't expect the pipe to become readable or to have any
-			 * errors either, treat those cases as postmaster death, too.
-			 */
-			if ((wakeEvents & WL_POSTMASTER_DEATH) &&
-				(pfds[nfds - 1].revents & (POLLHUP | POLLIN | POLLERR | POLLNVAL)))
-			{
-				/*
-				 * According to the select(2) man page on Linux, select(2) may
-				 * spuriously return and report a file descriptor as readable,
-				 * when it's not; and presumably so can poll(2).  It's not
-				 * clear that the relevant cases would ever apply to the
-				 * postmaster pipe, but since the consequences of falsely
-				 * returning WL_POSTMASTER_DEATH could be pretty unpleasant,
-				 * we take the trouble to positively verify EOF with
-				 * PostmasterIsAlive().
-				 */
-				if (!PostmasterIsAlive())
-					result |= WL_POSTMASTER_DEATH;
-			}
-		}
-#elif defined(LATCH_USE_SELECT)
-
-		/*
-		 * On at least older linux kernels select(), in violation of POSIX,
-		 * doesn't reliably return a socket as writable if closed - but we
-		 * rely on that. So far all the known cases of this problem are on
-		 * platforms that also provide a poll() implementation without that
-		 * bug.  If we find one where that's not the case, we'll need to add a
-		 * workaround.
-		 */
-		FD_ZERO(&input_mask);
-		FD_ZERO(&output_mask);
-
-		FD_SET(selfpipe_readfd, &input_mask);
-		hifd = selfpipe_readfd;
-
-		if (wakeEvents & WL_POSTMASTER_DEATH)
-		{
-			FD_SET(postmaster_alive_fds[POSTMASTER_FD_WATCH], &input_mask);
-			if (postmaster_alive_fds[POSTMASTER_FD_WATCH] > hifd)
-				hifd = postmaster_alive_fds[POSTMASTER_FD_WATCH];
-		}
-
-		if (wakeEvents & WL_SOCKET_READABLE)
-		{
-			FD_SET(sock, &input_mask);
-			if (sock > hifd)
-				hifd = sock;
-		}
-
-		if (wakeEvents & WL_SOCKET_WRITEABLE)
-		{
-			FD_SET(sock, &output_mask);
-			if (sock > hifd)
-				hifd = sock;
-		}
-
-		/* Sleep */
-		rc = select(hifd + 1, &input_mask, &output_mask, NULL, tvp);
-
-		/* Check return code */
-		if (rc < 0)
-		{
-			/* EINTR is okay, otherwise complain */
-			if (errno != EINTR)
-			{
-				waiting = false;
-				ereport(ERROR,
-						(errcode_for_socket_access(),
-						 errmsg("select() failed: %m")));
-			}
-		}
-		else if (rc == 0)
-		{
-			/* timeout exceeded */
-			if (wakeEvents & WL_TIMEOUT)
-				result |= WL_TIMEOUT;
-		}
-		else
-		{
-			/* at least one event occurred, so check masks */
-			if (FD_ISSET(selfpipe_readfd, &input_mask))
-			{
-				/* There's data in the self-pipe, clear it. */
-				drainSelfPipe();
-			}
-			if ((wakeEvents & WL_SOCKET_READABLE) && FD_ISSET(sock, &input_mask))
-			{
-				/* data available in socket, or EOF */
-				result |= WL_SOCKET_READABLE;
-			}
-			if ((wakeEvents & WL_SOCKET_WRITEABLE) && FD_ISSET(sock, &output_mask))
-			{
-				/* socket is writable, or EOF */
-				result |= WL_SOCKET_WRITEABLE;
-			}
-			if ((wakeEvents & WL_POSTMASTER_DEATH) &&
-				FD_ISSET(postmaster_alive_fds[POSTMASTER_FD_WATCH],
-						 &input_mask))
-			{
-				/*
-				 * According to the select(2) man page on Linux, select(2) may
-				 * spuriously return and report a file descriptor as readable,
-				 * when it's not; and presumably so can poll(2).  It's not
-				 * clear that the relevant cases would ever apply to the
-				 * postmaster pipe, but since the consequences of falsely
-				 * returning WL_POSTMASTER_DEATH could be pretty unpleasant,
-				 * we take the trouble to positively verify EOF with
-				 * PostmasterIsAlive().
-				 */
-				if (!PostmasterIsAlive())
-					result |= WL_POSTMASTER_DEATH;
-			}
-		}
-#endif   /* LATCH_USE_SELECT */
-
-		/*
-		 * Check again whether latch is set, the arrival of a signal/self-byte
-		 * might be what stopped our sleep. It's not required for correctness
-		 * to signal the latch as being set (we'd just loop if there's no
-		 * other event), but it seems good to report an arrived latch asap.
-		 * This way we also don't have to compute the current timestamp again.
-		 */
-		if ((wakeEvents & WL_LATCH_SET) && latch->is_set)
-			result |= WL_LATCH_SET;
-
-		/* If we're not done, update cur_timeout for next iteration */
-		if (result == 0 && (wakeEvents & WL_TIMEOUT))
-		{
-			INSTR_TIME_SET_CURRENT(cur_time);
-			INSTR_TIME_SUBTRACT(cur_time, start_time);
-			cur_timeout = timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
-			if (cur_timeout <= 0)
-			{
-				/* Timeout has expired, no need to continue looping */
-				result |= WL_TIMEOUT;
-			}
-#ifdef LATCH_USE_SELECT
-			else
-			{
-				tv.tv_sec = cur_timeout / 1000L;
-				tv.tv_usec = (cur_timeout % 1000L) * 1000L;
-			}
-#endif
-		}
-	} while (result == 0);
-	waiting = false;
-
-	return result;
-}
-#else /* LATCH_USE_WIN32 */
-int
-WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
-				  long timeout)
-{
-	DWORD		rc;
-	instr_time	start_time,
-				cur_time;
-	long		cur_timeout;
-	HANDLE		events[4];
-	HANDLE		latchevent;
-	HANDLE		sockevent = WSA_INVALID_EVENT;
-	int			numevents;
-	int			result = 0;
-	int			pmdeath_eventno = 0;
-
-	Assert(wakeEvents != 0);	/* must have at least one wake event */
-
-	/* waiting for socket readiness without a socket indicates a bug */
-	if (sock == PGINVALID_SOCKET &&
-		(wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
-		elog(ERROR, "cannot wait on socket events without a socket");
-
-	if ((wakeEvents & WL_LATCH_SET) && latch->owner_pid != MyProcPid)
-		elog(ERROR, "cannot wait on a latch owned by another process");
-
-	/*
-	 * Initialize timeout if requested.  We must record the current time so
-	 * that we can determine the remaining timeout if WaitForMultipleObjects
-	 * is interrupted.
-	 */
-	if (wakeEvents & WL_TIMEOUT)
-	{
-		INSTR_TIME_SET_CURRENT(start_time);
-		Assert(timeout >= 0 && timeout <= INT_MAX);
-		cur_timeout = timeout;
-	}
-	else
-		cur_timeout = INFINITE;
-
-	/*
-	 * Construct an array of event handles for WaitforMultipleObjects().
-	 *
-	 * Note: pgwin32_signal_event should be first to ensure that it will be
-	 * reported when multiple events are set.  We want to guarantee that
-	 * pending signals are serviced.
-	 */
-	latchevent = latch->event;
-
-	events[0] = pgwin32_signal_event;
-	events[1] = latchevent;
-	numevents = 2;
 	if (wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))
 	{
-		/* Need an event object to represent events on the socket */
-		int			flags = FD_CLOSE;	/* always check for errors/EOF */
+		int			ev;
 
-		if (wakeEvents & WL_SOCKET_READABLE)
-			flags |= FD_READ;
-		if (wakeEvents & WL_SOCKET_WRITEABLE)
-			flags |= FD_WRITE;
-
-		sockevent = WSACreateEvent();
-		if (sockevent == WSA_INVALID_EVENT)
-			elog(ERROR, "failed to create event for socket: error code %u",
-				 WSAGetLastError());
-		if (WSAEventSelect(sock, sockevent, flags) != 0)
-			elog(ERROR, "failed to set up event for socket: error code %u",
-				 WSAGetLastError());
-
-		events[numevents++] = sockevent;
-	}
-	if (wakeEvents & WL_POSTMASTER_DEATH)
-	{
-		pmdeath_eventno = numevents;
-		events[numevents++] = PostmasterHandle;
+		ev = wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE);
+		AddWaitEventToSet(set, ev, sock, NULL);
 	}
 
-	/* Ensure that signals are serviced even if latch is already set */
-	pgwin32_dispatch_queued_signals();
+	rc = WaitEventSetWait(set, timeout, &event, 1);
 
-	do
+	if (rc == 0)
+		ret |= WL_TIMEOUT;
+	else
 	{
-		/*
-		 * Reset the event, and check if the latch is set already. If someone
-		 * sets the latch between this and the WaitForMultipleObjects() call
-		 * below, the setter will set the event and WaitForMultipleObjects()
-		 * will return immediately.
-		 */
-		if (!ResetEvent(latchevent))
-			elog(ERROR, "ResetEvent failed: error code %lu", GetLastError());
-
-		if ((wakeEvents & WL_LATCH_SET) && latch->is_set)
-		{
-			result |= WL_LATCH_SET;
-
-			/*
-			 * Leave loop immediately, avoid blocking again. We don't attempt
-			 * to report any other events that might also be satisfied.
-			 */
-			break;
-		}
-
-		rc = WaitForMultipleObjects(numevents, events, FALSE, cur_timeout);
-
-		if (rc == WAIT_FAILED)
-			elog(ERROR, "WaitForMultipleObjects() failed: error code %lu",
-				 GetLastError());
-		else if (rc == WAIT_TIMEOUT)
-		{
-			result |= WL_TIMEOUT;
-		}
-		else if (rc == WAIT_OBJECT_0)
-		{
-			/* Service newly-arrived signals */
-			pgwin32_dispatch_queued_signals();
-		}
-		else if (rc == WAIT_OBJECT_0 + 1)
-		{
-			/*
-			 * Latch is set.  We'll handle that on next iteration of loop, but
-			 * let's not waste the cycles to update cur_timeout below.
-			 */
-			continue;
-		}
-		else if ((wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) &&
-				 rc == WAIT_OBJECT_0 + 2)		/* socket is at event slot 2 */
-		{
-			WSANETWORKEVENTS resEvents;
-
-			ZeroMemory(&resEvents, sizeof(resEvents));
-			if (WSAEnumNetworkEvents(sock, sockevent, &resEvents) != 0)
-				elog(ERROR, "failed to enumerate network events: error code %u",
-					 WSAGetLastError());
-			if ((wakeEvents & WL_SOCKET_READABLE) &&
-				(resEvents.lNetworkEvents & FD_READ))
-			{
-				result |= WL_SOCKET_READABLE;
-			}
-			if ((wakeEvents & WL_SOCKET_WRITEABLE) &&
-				(resEvents.lNetworkEvents & FD_WRITE))
-			{
-				result |= WL_SOCKET_WRITEABLE;
-			}
-			if (resEvents.lNetworkEvents & FD_CLOSE)
-			{
-				if (wakeEvents & WL_SOCKET_READABLE)
-					result |= WL_SOCKET_READABLE;
-				if (wakeEvents & WL_SOCKET_WRITEABLE)
-					result |= WL_SOCKET_WRITEABLE;
-			}
-		}
-		else if ((wakeEvents & WL_POSTMASTER_DEATH) &&
-				 rc == WAIT_OBJECT_0 + pmdeath_eventno)
-		{
-			/*
-			 * Postmaster apparently died.  Since the consequences of falsely
-			 * returning WL_POSTMASTER_DEATH could be pretty unpleasant, we
-			 * take the trouble to positively verify this with
-			 * PostmasterIsAlive(), even though there is no known reason to
-			 * think that the event could be falsely set on Windows.
-			 */
-			if (!PostmasterIsAlive())
-				result |= WL_POSTMASTER_DEATH;
-		}
-		else
-			elog(ERROR, "unexpected return code from WaitForMultipleObjects(): %lu", rc);
-
-		/* If we're not done, update cur_timeout for next iteration */
-		if (result == 0 && (wakeEvents & WL_TIMEOUT))
-		{
-			INSTR_TIME_SET_CURRENT(cur_time);
-			INSTR_TIME_SUBTRACT(cur_time, start_time);
-			cur_timeout = timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
-			if (cur_timeout <= 0)
-			{
-				/* Timeout has expired, no need to continue looping */
-				result |= WL_TIMEOUT;
-			}
-		}
-	} while (result == 0);
-
-	/* Clean up the event object we created for the socket */
-	if (sockevent != WSA_INVALID_EVENT)
-	{
-		WSAEventSelect(sock, NULL, 0);
-		WSACloseEvent(sockevent);
+		ret |= event.events & (WL_LATCH_SET |
+							   WL_POSTMASTER_DEATH |
+							   WL_SOCKET_READABLE |
+							   WL_SOCKET_WRITEABLE);
 	}
 
-	return result;
+	FreeWaitEventSet(set);
+
+	return ret;
 }
-#endif /* LATCH_USE_WIN32 */
 
 /*
  * Sets a latch and wakes up anyone waiting on it.
@@ -814,7 +398,6 @@ SetLatch(volatile Latch *latch)
 	latch->is_set = true;
 
 #ifndef WIN32
-
 	/*
 	 * See if anyone's waiting for the latch. It can be the current process if
 	 * we're in a signal handler. We use the self-pipe to wake up the select()
@@ -891,6 +474,986 @@ ResetLatch(volatile Latch *latch)
 }
 
 /*
+ * Create a WaitEventSet with space for nevents different events to wait for.
+ *
+ * These events can then efficiently waited upon together, using
+ * WaitEventSetWait().
+ */
+WaitEventSet *
+CreateWaitEventSet(MemoryContext context, int nevents)
+{
+	WaitEventSet *set;
+	char	   *data;
+	Size		sz = 0;
+
+	sz += sizeof(WaitEventSet);
+	sz += sizeof(WaitEvent) * nevents;
+
+#if defined(WAIT_USE_EPOLL)
+	sz += sizeof(struct epoll_event) * nevents;
+#elif defined(WAIT_USE_POLL)
+	sz += sizeof(struct pollfd) * nevents;
+#elif defined(WAIT_USE_WIN32)
+	/* need space for the pgwin32_signal_event */
+	sz += sizeof(HANDLE) * (nevents + 1);
+#endif
+
+	data = (char *) MemoryContextAllocZero(context, sz);
+
+	set = (WaitEventSet *) data;
+	data += sizeof(WaitEventSet);
+
+	set->events = (WaitEvent *) data;
+	data += sizeof(WaitEvent) * nevents;
+
+#if defined(WAIT_USE_EPOLL)
+	set->epoll_ret_events = (struct epoll_event *) data;
+	data += sizeof(struct epoll_event) * nevents;
+#elif defined(WAIT_USE_POLL)
+	set->pollfds = (struct pollfd *) data;
+	data += sizeof(struct pollfd) * nevents;
+#elif defined(WAIT_USE_WIN32)
+	set->handles = (HANDLE) data;
+	data += sizeof(HANDLE) * nevents;
+#endif
+
+	set->latch = NULL;
+	set->nevents_space = nevents;
+
+#if defined(WAIT_USE_EPOLL)
+	set->epoll_fd = epoll_create(nevents);
+	if (set->epoll_fd < 0)
+		elog(ERROR, "epoll_create failed: %m");
+#elif defined(WAIT_USE_WIN32)
+
+	/*
+	 * To handle signals while waiting, we need to add a win32 specific event.
+	 * We accounted for the additional event at the top of this routine. See
+	 * port/win32/signal.c for more details.
+	 *
+	 * Note: pgwin32_signal_event should be first to ensure that it will be
+	 * reported when multiple events are set.  We want to guarantee that
+	 * pending signals are serviced.
+	 */
+	set->handles[0] = pgwin32_signal_event;
+#endif
+
+	return set;
+}
+
+/*
+ * Free a previously created WaitEventSet.
+ */
+void
+FreeWaitEventSet(WaitEventSet *set)
+{
+#if defined(WAIT_USE_EPOLL)
+	close(set->epoll_fd);
+#elif defined(WAIT_USE_WIN32)
+	WaitEvent  *cur_event;
+
+	for (cur_event = set->events;
+		 cur_event < (cur_event + set->nevents);
+		 cur_event++)
+	{
+		if (cur_event->events & WL_LATCH_SET)
+		{
+			/* uses the latch's HANDLE */
+		}
+		else if (cur_event->events & WL_POSTMASTER_DEATH)
+		{
+			/* uses PostmasterHandle */
+		}
+		else
+		{
+			/* Clean up the event object we created for the socket */
+			WSAEventSelect(cur_event->fd, NULL, 0);
+			WSACloseEvent(set->handles[cur_event->pos + 1]);
+		}
+	}
+#endif
+
+	pfree(set);
+}
+
+/* ---
+ * Add an event to the set. Possible events are:
+ * - WL_LATCH_SET: Wait for the latch to be set
+ * - WL_POSTMASTER_DEATH: Wait for postmaster to die
+ * - WL_SOCKET_READABLE: Wait for socket to become readable
+ *	 can be combined in one event with WL_SOCKET_WRITEABLE
+ * - WL_SOCKET_WRITEABLE: Wait for socket to become writeable
+ *	 can be combined with WL_SOCKET_READABLE
+ *
+ * Returns the offset in WaitEventSet->events (starting from 0), which can be
+ * used to modify previously added wait events using ModifyWaitEvent().
+ *
+ * In the WL_LATCH_SET case the latch must be owned by the current process,
+ * i.e. it must be a backend-local latch initialized with InitLatch, or a
+ * shared latch associated with the current process by calling OwnLatch.
+ *
+ * In the WL_SOCKET_READABLE/WRITEABLE case, EOF and error conditions are
+ * reported by returning the socket as readable/writable or both, depending on
+ * WL_SOCKET_READABLE/WRITEABLE being specified.
+ */
+int
+AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd, Latch *latch)
+{
+	WaitEvent  *event;
+
+	/* not enough space */
+	Assert(set->nevents < set->nevents_space);
+
+	if (set->latch && latch)
+		elog(ERROR, "cannot wait on more than one latch");
+
+	if (latch == NULL && (events & WL_LATCH_SET))
+		elog(ERROR, "cannot wait on latch without a specified latch");
+
+	/* waiting for socket readiness without a socket indicates a bug */
+	if (fd == PGINVALID_SOCKET &&
+		(events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)))
+		elog(ERROR, "cannot wait on socket events without a socket");
+
+	event = &set->events[set->nevents];
+	event->pos = set->nevents++;
+	event->fd = fd;
+	event->events = events;
+
+	if (events == WL_LATCH_SET)
+	{
+		set->latch = latch;
+		set->latch_pos = event->pos;
+#ifndef WIN32
+		event->fd = selfpipe_readfd;
+#endif
+	}
+	else if (events == WL_POSTMASTER_DEATH)
+	{
+#ifndef WIN32
+		event->fd = postmaster_alive_fds[POSTMASTER_FD_WATCH];
+#endif
+	}
+
+	/* perform wait primitive specific initialization, if needed */
+#if defined(WAIT_USE_EPOLL)
+	WaitEventAdjustEpoll(set, event, EPOLL_CTL_ADD);
+#elif defined(WAIT_USE_POLL)
+	WaitEventAdjustPoll(set, event);
+#elif defined(WAIT_USE_SELECT)
+	/* nothing to do */
+#elif defined(WAIT_USE_WIN32)
+	WaitEventAdjustWin32(set, event);
+#endif
+
+	return event->pos;
+}
+
+/*
+ * Change the event mask and, in the WL_LATCH_SET case, the latch associated
+ * with the WaitEvent.
+ *
+ * 'pos' is the id returned by AddWaitEventToSet.
+ */
+void
+ModifyWaitEvent(WaitEventSet *set, int pos, uint32 events, Latch *latch)
+{
+	WaitEvent  *event;
+
+	Assert(pos < set->nevents);
+
+	event = &set->events[pos];
+
+	/*
+	 * If neither the event mask nor the associated latch changes, return
+	 * early. That's an important optimization for some sockets, were
+	 * ModifyWaitEvent is frequently used to switch from waiting for reads to
+	 * waiting on writes.
+	 */
+	if (events == event->events &&
+		(!(event->events & WL_LATCH_SET) || set->latch == latch))
+		return;
+
+	if (event->events & WL_LATCH_SET &&
+		events != event->events)
+	{
+		/* we could allow to disable latch events for a while */
+		elog(ERROR, "cannot modify latch event");
+	}
+
+	if (event->events & WL_POSTMASTER_DEATH)
+	{
+		elog(ERROR, "cannot modify postmaster death event");
+	}
+
+	/* FIXME: validate event mask */
+	event->events = events;
+
+	if (events == WL_LATCH_SET)
+	{
+		set->latch = latch;
+	}
+
+#if defined(WAIT_USE_EPOLL)
+	WaitEventAdjustEpoll(set, event, EPOLL_CTL_MOD);
+#elif defined(WAIT_USE_POLL)
+	WaitEventAdjustPoll(set, event);
+#elif defined(WAIT_USE_SELECT)
+	/* nothing to do */
+#elif defined(WAIT_USE_WIN32)
+	WaitEventAdjustWin32(set, event);
+#endif
+}
+
+/*
+ * Wait for events added to the set to happen, or until the timeout is
+ * reached.  At most nevents occurred events are returned.
+ *
+ * If timeout = -1, block until an event occurs; if 0, check sockets for
+ * readiness, but don't block; if > 0, block for at most timeout miliseconds.
+ *
+ * Returns the number of events occurred, or 0 if the timeout was reached.
+ */
+int
+WaitEventSetWait(WaitEventSet *set, long timeout,
+				 WaitEvent *occurred_events, int nevents)
+{
+	int			returned_events = 0;
+	instr_time	start_time;
+	instr_time	cur_time;
+	long		cur_timeout = -1;
+
+	Assert(nevents > 0);
+
+	/*
+	 * Initialize timeout if requested.  We must record the current time so
+	 * that we can determine the remaining timeout if interrupted.
+	 */
+	if (timeout >= 0)
+	{
+		INSTR_TIME_SET_CURRENT(start_time);
+		Assert(timeout >= 0 && timeout <= INT_MAX);
+		cur_timeout = timeout;
+	}
+
+#ifndef WIN32
+	waiting = true;
+#else
+	/* Ensure that signals are serviced even if latch is already set */
+	pgwin32_dispatch_queued_signals();
+#endif
+	while (returned_events == 0)
+	{
+		int			rc;
+
+		/*
+		 * Check if the latch is set already. If so, leave the loop
+		 * immediately, avoid blocking again. We don't attempt to report any
+		 * other events that might also be satisfied.
+		 *
+		 * If someone sets the latch between this and the
+		 * WaitEventSetWaitBlock() below, the setter will write a byte to the
+		 * pipe (or signal us and the signal handler will do that), and the
+		 * readiness routine will return immediately.
+		 *
+		 * On unix, If there's a pending byte in the self pipe, we'll notice
+		 * whenever blocking. Only clearing the pipe in that case avoids
+		 * having to drain it every time WaitLatchOrSocket() is used. Should
+		 * the pipe-buffer fill up we're still ok, because the pipe is in
+		 * nonblocking mode. It's unlikely for that to happen, because the
+		 * self pipe isn't filled unless we're blocking (waiting = true), or
+		 * from inside a signal handler in latch_sigusr1_handler().
+		 *
+		 * On windows, we'll also notice if there's a pending event for the
+		 * latch when blocking, but there's no danger of anything filling up,
+		 * as "Setting an event that is already set has no effect.".
+		 *
+		 * Note: we assume that the kernel calls involved in latch management
+		 * will provide adequate synchronization on machines with weak memory
+		 * ordering, so that we cannot miss seeing is_set if a notification
+		 * has already been queued.
+		 */
+		if (set->latch && set->latch->is_set)
+		{
+			occurred_events->fd = PGINVALID_SOCKET;
+			occurred_events->pos = set->latch_pos;
+			occurred_events->events = WL_LATCH_SET;
+			occurred_events++;
+			returned_events++;
+
+			break;
+		}
+
+		/*
+		 * Wait for events using the readiness primitive chosen at the top of
+		 * this file. If -1 is returned, a timeout has occurred, if 0 we have
+		 * to retry, everything >= 1 is the number of returned events.
+		 */
+		rc = WaitEventSetWaitBlock(set, cur_timeout,
+								   occurred_events, nevents);
+
+		if (rc == -1)
+			break;				/* timeout occurred */
+		else
+			returned_events = rc;
+
+		/* If we're not done, update cur_timeout for next iteration */
+		if (returned_events == 0 && timeout >= 0)
+		{
+			INSTR_TIME_SET_CURRENT(cur_time);
+			INSTR_TIME_SUBTRACT(cur_time, start_time);
+			cur_timeout = timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
+			if (cur_timeout <= 0)
+				break;
+		}
+	}
+#ifndef WIN32
+	waiting = false;
+#endif
+
+	return returned_events;
+}
+
+#if defined(WAIT_USE_EPOLL)
+/*
+ * action can be one of EPOLL_CTL_ADD | EPOLL_CTL_MOD | EPOLL_CTL_DEL
+ */
+static void
+WaitEventAdjustEpoll(WaitEventSet *set, WaitEvent *event, int action)
+{
+	struct epoll_event epoll_ev;
+	int			rc;
+
+	/* pointer to our event, returned by epoll_wait */
+	epoll_ev.data.ptr = event;
+	/* always wait for errors */
+	epoll_ev.events = EPOLLERR | EPOLLHUP;
+
+	/* prepare pollfd entry once */
+	if (event->events == WL_LATCH_SET)
+	{
+		Assert(set->latch != NULL);
+		epoll_ev.events |= EPOLLIN;
+	}
+	else if (event->events == WL_POSTMASTER_DEATH)
+	{
+		epoll_ev.events |= EPOLLIN;
+	}
+	else
+	{
+		Assert(event->fd >= 0);
+		Assert(event->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE));
+
+		if (event->events & WL_SOCKET_READABLE)
+			epoll_ev.events |= EPOLLIN;
+		if (event->events & WL_SOCKET_WRITEABLE)
+			epoll_ev.events |= EPOLLOUT;
+	}
+
+	/*
+	 * Even though unused, we also poss epoll_ev as the data argument if
+	 * EPOLL_CTL_DELETE is passed as action.  There used to be an epoll bug
+	 * requiring that, and acutally it makes the code simpler...
+	 */
+	rc = epoll_ctl(set->epoll_fd, action, event->fd, &epoll_ev);
+
+	if (rc < 0)
+		ereport(ERROR,
+				(errcode_for_socket_access(),
+				 errmsg("epoll_ctl() failed: %m")));
+}
+#endif
+
+#if defined(WAIT_USE_POLL)
+static void
+WaitEventAdjustPoll(WaitEventSet *set, WaitEvent *event)
+{
+	struct pollfd *pollfd = &set->pollfds[event->pos];
+
+	pollfd->revents = 0;
+	pollfd->fd = event->fd;
+
+	/* prepare pollfd entry once */
+	if (event->events == WL_LATCH_SET)
+	{
+		Assert(set->latch != NULL);
+		pollfd->events = POLLIN;
+	}
+	else if (event->events == WL_POSTMASTER_DEATH)
+	{
+		pollfd->events = POLLIN;
+	}
+	else
+	{
+		Assert(event->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE));
+		pollfd->events = 0;
+		if (event->events & WL_SOCKET_READABLE)
+			pollfd->events |= POLLIN;
+		if (event->events & WL_SOCKET_WRITEABLE)
+			pollfd->events |= POLLOUT;
+	}
+
+	Assert(event->fd >= 0);
+}
+#endif
+
+#if defined(WAIT_USE_WIN32)
+static void
+WaitEventAdjustWin32(WaitEventSet *set, WaitEvent *event)
+{
+	HANDLE	   *handle = &set->handles[event->pos + 1];
+
+	if (event->events == WL_LATCH_SET)
+	{
+		Assert(set->latch != NULL);
+		*handle = set->latch->event;
+	}
+	else if (event->events == WL_POSTMASTER_DEATH)
+	{
+		*handle = PostmasterHandle;
+	}
+	else
+	{
+		int			flags = FD_CLOSE;	/* always check for errors/EOF */
+
+		if (event->events & WL_SOCKET_READABLE)
+			flags |= FD_READ;
+		if (event->events & WL_SOCKET_WRITEABLE)
+			flags |= FD_WRITE;
+
+		if (*handle != WSA_INVALID_EVENT)
+		{
+			*handle = WSACreateEvent();
+			if (*handle == WSA_INVALID_EVENT)
+				elog(ERROR, "failed to create event for socket: error code %u",
+					 WSAGetLastError());
+		}
+		if (WSAEventSelect(event->fd, *handle, flags) != 0)
+			elog(ERROR, "failed to set up event for socket: error code %u",
+				 WSAGetLastError());
+
+		Assert(event->fd >= 0);
+	}
+}
+#endif
+
+
+#if defined(WAIT_USE_EPOLL)
+
+/*
+ * Wait using linux' epoll_wait(2).
+ *
+ * This is the preferrable wait method, as several readiness notifications are
+ * delivered, without having to iterate through all of set->events. The return
+ * epoll_event struct contain a pointer to our events, making association
+ * easy.
+ */
+static int
+WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
+					  WaitEvent *occurred_events, int nevents)
+{
+	int			returned_events = 0;
+	int			rc;
+	WaitEvent  *cur_event;
+	struct epoll_event *cur_epoll_event;
+
+	/* Sleep */
+	rc = epoll_wait(set->epoll_fd, set->epoll_ret_events,
+					nevents, cur_timeout);
+
+	/* Check return code */
+	if (rc < 0)
+	{
+		/* EINTR is okay, otherwise complain */
+		if (errno != EINTR)
+		{
+			waiting = false;
+			ereport(ERROR,
+					(errcode_for_socket_access(),
+					 errmsg("epoll_wait() failed: %m")));
+		}
+		return 0;
+	}
+	else if (rc == 0)
+	{
+		/* timeout exceeded */
+		return -1;
+	}
+
+	/*
+	 * At least one event occurred, iterate over the returned epoll events
+	 * until they're either all processed, or we've returned all the events
+	 * the caller desired.
+	 */
+	for (cur_epoll_event = set->epoll_ret_events;
+		 cur_epoll_event < (set->epoll_ret_events + rc) &&
+		 returned_events < nevents;
+		 cur_epoll_event++)
+	{
+		/* epoll's data pointer is set to the associated WaitEvent */
+		cur_event = (WaitEvent *) cur_epoll_event->data.ptr;
+
+		occurred_events->pos = cur_event->pos;
+		occurred_events->events = 0;
+
+		if (cur_event->events == WL_LATCH_SET &&
+			cur_epoll_event->events & (EPOLLIN | EPOLLERR | EPOLLHUP))
+		{
+			/* There's data in the self-pipe, clear it. */
+			drainSelfPipe();
+
+			if (set->latch->is_set)
+			{
+				occurred_events->fd = PGINVALID_SOCKET;
+				occurred_events->events = WL_LATCH_SET;
+				occurred_events++;
+				returned_events++;
+			}
+		}
+		else if (cur_event->events == WL_POSTMASTER_DEATH &&
+				 cur_epoll_event->events & (EPOLLIN | EPOLLERR | EPOLLHUP))
+		{
+			/*
+			 * We expect an EPOLLHUP when the remote end is closed, but
+			 * because we don't expect the pipe to become readable or to have
+			 * any errors either, treat those cases as postmaster death, too.
+			 *
+			 * As explained in the WAIT_USE_SELECT implementation, select(2)
+			 * may spuriously return. Be paranoid about that here too, a
+			 * spurious WL_POSTMASTER_DEATH would be painful.
+			 */
+			if (!PostmasterIsAlive())
+			{
+				occurred_events->fd = PGINVALID_SOCKET;
+				occurred_events->events = WL_POSTMASTER_DEATH;
+				occurred_events++;
+				returned_events++;
+			}
+		}
+		else if (cur_event->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))
+		{
+			Assert(cur_event->fd >= 0);
+
+			if ((cur_event->events & WL_SOCKET_READABLE) &&
+				(cur_epoll_event->events & (EPOLLIN | EPOLLERR | EPOLLHUP)))
+			{
+				/* readable, or EOF */
+				occurred_events->events |= WL_SOCKET_READABLE;
+			}
+
+			if ((cur_event->events & WL_SOCKET_WRITEABLE) &&
+				(cur_epoll_event->events & (EPOLLOUT | EPOLLERR | EPOLLHUP)))
+			{
+				/* writable, or EOF */
+				occurred_events->events |= WL_SOCKET_WRITEABLE;
+			}
+
+			if (occurred_events->events != 0)
+			{
+				occurred_events->fd = cur_event->fd;
+				occurred_events++;
+				returned_events++;
+			}
+		}
+	}
+
+	return returned_events;
+}
+
+#elif defined(WAIT_USE_POLL)
+
+/*
+ * Wait using poll(2).
+ *
+ * This allows to receive readiness notifications for several events at once,
+ * but requires iterating through all of set->pollfds.
+ */
+static inline int
+WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
+					  WaitEvent *occurred_events, int nevents)
+{
+	int			returned_events = 0;
+	int			rc;
+	WaitEvent  *cur_event;
+	struct pollfd *cur_pollfd;
+
+	/* Sleep */
+	rc = poll(set->pollfds, set->nevents, (int) cur_timeout);
+
+	/* Check return code */
+	if (rc < 0)
+	{
+		/* EINTR is okay, otherwise complain */
+		if (errno != EINTR)
+		{
+			waiting = false;
+			ereport(ERROR,
+					(errcode_for_socket_access(),
+					 errmsg("poll() failed: %m")));
+		}
+		return 0;
+	}
+	else if (rc == 0)
+	{
+		/* timeout exceeded */
+		return -1;
+	}
+
+	for (cur_event = set->events, cur_pollfd = set->pollfds;
+		 cur_event < (set->events + set->nevents) &&
+		 returned_events < nevents;
+		 cur_event++, cur_pollfd++)
+	{
+		/* no activity on this FD, skip */
+		if (cur_pollfd->revents == 0)
+			continue;
+
+		occurred_events->pos = cur_event->pos;
+		occurred_events->events = 0;
+
+		if (cur_event->events == WL_LATCH_SET &&
+			(cur_pollfd->revents & (POLLIN | POLLHUP | POLLERR | POLLNVAL)))
+		{
+			/* There's data in the self-pipe, clear it. */
+			drainSelfPipe();
+
+			if (set->latch->is_set)
+			{
+				occurred_events->fd = PGINVALID_SOCKET;
+				occurred_events->events = WL_LATCH_SET;
+				occurred_events++;
+				returned_events++;
+			}
+		}
+		else if (cur_event->events == WL_POSTMASTER_DEATH &&
+			 (cur_pollfd->revents & (POLLIN | POLLHUP | POLLERR | POLLNVAL)))
+		{
+			/*
+			 * We expect an POLLHUP when the remote end is closed, but because
+			 * we don't expect the pipe to become readable or to have any
+			 * errors either, treat those cases as postmaster death, too.
+			 *
+			 * As explained in the WAIT_USE_SELECT implementation, select(2)
+			 * may spuriously return. Be paranoid about that here too, a
+			 * spurious WL_POSTMASTER_DEATH would be painful.
+			 */
+			if (!PostmasterIsAlive())
+			{
+				occurred_events->fd = PGINVALID_SOCKET;
+				occurred_events->events = WL_POSTMASTER_DEATH;
+				occurred_events++;
+				returned_events++;
+			}
+		}
+		else if (cur_event->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))
+		{
+			Assert(cur_event->fd);
+
+			if ((cur_event->events & WL_SOCKET_READABLE) &&
+			 (cur_pollfd->revents & (POLLIN | POLLHUP | POLLERR | POLLNVAL)))
+			{
+				occurred_events->events |= WL_SOCKET_READABLE;
+			}
+
+			if ((cur_event->events & WL_SOCKET_WRITEABLE) &&
+			(cur_pollfd->revents & (POLLOUT | POLLHUP | POLLERR | POLLNVAL)))
+			{
+				occurred_events->events |= WL_SOCKET_WRITEABLE;
+			}
+
+			if (occurred_events->events != 0)
+			{
+				occurred_events->fd = cur_event->fd;
+				occurred_events++;
+				returned_events++;
+			}
+		}
+	}
+	return returned_events;
+}
+
+#elif defined(WAIT_USE_SELECT)
+
+/*
+ * Wait using select(2).
+ *
+ * XXX: On at least older linux kernels select(), in violation of POSIX,
+ * doesn't reliably return a socket as writable if closed - but we rely on
+ * that. So far all the known cases of this problem are on platforms that also
+ * provide a poll() implementation without that bug.  If we find one where
+ * that's not the case, we'll need to add a workaround.
+ */
+static inline int
+WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
+					  WaitEvent *occurred_events, int nevents)
+{
+	int			returned_events = 0;
+	int			rc;
+	WaitEvent  *cur_event;
+	fd_set		input_mask;
+	fd_set		output_mask;
+	int			hifd;
+	struct timeval tv;
+	struct timeval *tvp = NULL;
+
+	FD_ZERO(&input_mask);
+	FD_ZERO(&output_mask);
+
+	/*
+	 * Prepare input/output masks. We do so every loop iteration as there's no
+	 * entirely portable way to copy fd_sets.
+	 */
+	for (cur_event = set->events;
+		 cur_event < (set->events + set->nevents);
+		 cur_event++)
+	{
+		if (cur_event->events == WL_LATCH_SET)
+			FD_SET(cur_event->fd, &input_mask);
+		else if (cur_event->events == WL_POSTMASTER_DEATH)
+			FD_SET(cur_event->fd, &input_mask);
+		else
+		{
+			Assert(cur_event->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE));
+			if (cur_event->events == WL_SOCKET_READABLE)
+				FD_SET(cur_event->fd, &input_mask);
+			else if (cur_event->events == WL_SOCKET_WRITEABLE)
+				FD_SET(cur_event->fd, &output_mask);
+		}
+
+		if (cur_event->fd > hifd)
+			hifd = cur_event->fd;
+	}
+
+	/* Sleep */
+	if (cur_timeout >= 0)
+	{
+		tv.tv_sec = cur_timeout / 1000L;
+		tv.tv_usec = (cur_timeout % 1000L) * 1000L;
+		tvp = &tv;
+	}
+	rc = select(hifd + 1, &input_mask, &output_mask, NULL, tvp);
+
+	/* Check return code */
+	if (rc < 0)
+	{
+		/* EINTR is okay, otherwise complain */
+		if (errno != EINTR)
+		{
+			waiting = false;
+			ereport(ERROR,
+					(errcode_for_socket_access(),
+					 errmsg("select() failed: %m")));
+		}
+		return 0;				/* retry */
+	}
+	else if (rc == 0)
+	{
+		/* timeout exceeded */
+		return -1;
+	}
+
+	/*
+	 * To associate events with select's masks, we have to check the status of
+	 * the file descriptors associated with an event; by looping through all
+	 * events.
+	 */
+	for (cur_event = set->events;
+		 cur_event < (set->events + set->nevents)
+		 && returned_events < nevents;
+		 cur_event++)
+	{
+		occurred_events->pos = cur_event->pos;
+		occurred_events->events = 0;
+
+		if (cur_event->events == WL_LATCH_SET &&
+			FD_ISSET(cur_event->fd, &input_mask))
+		{
+			/* There's data in the self-pipe, clear it. */
+			drainSelfPipe();
+
+			if (set->latch->is_set)
+			{
+				occurred_events->fd = PGINVALID_SOCKET;
+				occurred_events->events = WL_LATCH_SET;
+				occurred_events++;
+				returned_events++;
+			}
+		}
+		else if (cur_event->events == WL_POSTMASTER_DEATH &&
+				 FD_ISSET(cur_event->fd, &input_mask))
+		{
+			/*
+			 * According to the select(2) man page on Linux, select(2) may
+			 * spuriously return and report a file descriptor as readable,
+			 * when it's not; and presumably so can poll(2).  It's not clear
+			 * that the relevant cases would ever apply to the postmaster
+			 * pipe, but since the consequences of falsely returning
+			 * WL_POSTMASTER_DEATH could be pretty unpleasant, we take the
+			 * trouble to positively verify EOF with PostmasterIsAlive().
+			 */
+			if (!PostmasterIsAlive())
+			{
+				occurred_events->fd = PGINVALID_SOCKET;
+				occurred_events->events = WL_POSTMASTER_DEATH;
+				occurred_events++;
+				returned_events++;
+			}
+		}
+		else if (cur_event->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))
+		{
+			Assert(cur_event->fd >= 0);
+
+			if ((cur_event->events & WL_SOCKET_READABLE) &&
+				FD_ISSET(cur_event->fd, &input_mask))
+			{
+				/* data available in socket, or EOF */
+				occurred_events->events |= WL_SOCKET_READABLE;
+			}
+
+			if ((cur_event->events & WL_SOCKET_WRITEABLE) &&
+				FD_ISSET(cur_event->fd, &output_mask))
+			{
+				/* socket is writeable, or EOF */
+				occurred_events->events |= WL_SOCKET_WRITEABLE;
+			}
+
+			if (occurred_events->events != 0)
+			{
+				occurred_events->fd = cur_event->fd;
+				occurred_events++;
+				returned_events++;
+			}
+		}
+	}
+	return returned_events;
+}
+
+#elif defined(WAIT_USE_WIN32)
+
+/*
+ * Wait using Windows' WaitForMultipleObjects().
+ *
+ * Unfortunately this will only ever return a single readiness notification at
+ * a time.  Note that while the official documentation for
+ * WaitForMultipleObjects is ambiguous about multiple events being "consumed"
+ * with a single bWaitAll = FALSE call,
+ * https://blogs.msdn.microsoft.com/oldnewthing/20150409-00/?p=44273 confirms
+ * that only one event is "consumed".
+ */
+static inline int
+WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
+					  WaitEvent *occurred_events, int nevents)
+{
+	int			returned_events = 0;
+	DWORD		rc;
+	WaitEvent  *cur_event;
+
+	/*
+	 * Sleep.
+	 *
+	 * Need to wait for ->nevents + 1, because signal handle is in [0].
+	 */
+	rc = WaitForMultipleObjects(set->nevents + 1, set->handles, FALSE,
+								cur_timeout);
+
+	/* Check return code */
+	if (rc == WAIT_FAILED)
+		elog(ERROR, "WaitForMultipleObjects() failed: error code %lu",
+			 GetLastError());
+	else if (rc == WAIT_TIMEOUT)
+	{
+		/* timeout exceeded */
+		return -1;
+	}
+
+	if (rc == WAIT_OBJECT_0)
+	{
+		/* Service newly-arrived signals */
+		pgwin32_dispatch_queued_signals();
+		return 0;				/* retry */
+	}
+
+	/*
+	 * With an offset of one, due to pgwin32_signal_event, the handle offset
+	 * directly corresponds to a wait event.
+	 */
+	cur_event = (WaitEvent *) &set->events[rc - WAIT_OBJECT_0 - 1];
+
+	occurred_events->pos = cur_event->pos;
+	occurred_events->events = 0;
+
+	if (cur_event->events == WL_LATCH_SET)
+	{
+		if (!ResetEvent(set->latch->event))
+			elog(ERROR, "ResetEvent failed: error code %lu", GetLastError());
+
+		if (set->latch->is_set)
+		{
+			occurred_events->fd = PGINVALID_SOCKET;
+			occurred_events->events = WL_LATCH_SET;
+			occurred_events++;
+			returned_events++;
+		}
+	}
+	else if (cur_event->events == WL_POSTMASTER_DEATH)
+	{
+		/*
+		 * Postmaster apparently died.  Since the consequences of falsely
+		 * returning WL_POSTMASTER_DEATH could be pretty unpleasant, we take
+		 * the trouble to positively verify this with PostmasterIsAlive(),
+		 * even though there is no known reason to think that the event could
+		 * be falsely set on Windows.
+		 */
+		if (!PostmasterIsAlive())
+		{
+			occurred_events->fd = PGINVALID_SOCKET;
+			occurred_events->events = WL_POSTMASTER_DEATH;
+			occurred_events++;
+			returned_events++;
+		}
+	}
+	else if (cur_event->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))
+	{
+		WSANETWORKEVENTS resEvents;
+
+		Assert(cur_event->fd);
+
+		occurred_events->fd = cur_event->fd;
+
+		ZeroMemory(&resEvents, sizeof(resEvents));
+		if (WSAEnumNetworkEvents(cur_event->fd, set->handles[cur_event->pos], &resEvents) != 0)
+			elog(ERROR, "failed to enumerate network events: error code %u",
+				 WSAGetLastError());
+		if ((cur_event->events & WL_SOCKET_READABLE) &&
+			(resEvents.lNetworkEvents & FD_READ))
+		{
+			occurred_events->events |= WL_SOCKET_READABLE;
+		}
+		if ((cur_event->events & WL_SOCKET_WRITEABLE) &&
+			(resEvents.lNetworkEvents & FD_WRITE))
+		{
+			occurred_events->events |= WL_SOCKET_WRITEABLE;
+		}
+		if (resEvents.lNetworkEvents & FD_CLOSE)
+		{
+			if (cur_event->events & WL_SOCKET_READABLE)
+				occurred_events->events |= WL_SOCKET_READABLE;
+			if (cur_event->events & WL_SOCKET_WRITEABLE)
+				occurred_events->events |= WL_SOCKET_WRITEABLE;
+		}
+
+		if (occurred_events->events != 0)
+		{
+			occurred_events++;
+			returned_events++;
+		}
+	}
+
+	return returned_events;
+}
+#endif
+
+/*
  * SetLatch uses SIGUSR1 to wake up the process waiting on the latch.
  *
  * Wake up WaitLatch, if we're waiting.  (We might not be, since SIGUSR1 is
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 18f5e6f..d13355b 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -33,6 +33,7 @@
 
 #include "access/htup_details.h"
 #include "catalog/pg_authid.h"
+#include "libpq/libpq.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "postmaster/autovacuum.h"
@@ -247,6 +248,9 @@ SwitchToSharedLatch(void)
 
 	MyLatch = &MyProc->procLatch;
 
+	if (FeBeWaitSet)
+		ModifyWaitEvent(FeBeWaitSet, 1, WL_LATCH_SET, MyLatch);
+
 	/*
 	 * Set the shared latch as the local one might have been set. This
 	 * shouldn't normally be necessary as code is supposed to check the
@@ -262,6 +266,10 @@ SwitchBackToLocalLatch(void)
 	Assert(MyProc != NULL && MyLatch == &MyProc->procLatch);
 
 	MyLatch = &LocalLatchData;
+
+	if (FeBeWaitSet)
+		ModifyWaitEvent(FeBeWaitSet, 1, WL_LATCH_SET, MyLatch);
+
 	SetLatch(MyLatch);
 }
 
diff --git a/src/include/libpq/libpq.h b/src/include/libpq/libpq.h
index 0569994..109fdf7 100644
--- a/src/include/libpq/libpq.h
+++ b/src/include/libpq/libpq.h
@@ -19,6 +19,7 @@
 
 #include "lib/stringinfo.h"
 #include "libpq/libpq-be.h"
+#include "storage/latch.h"
 
 
 typedef struct
@@ -95,6 +96,8 @@ extern ssize_t secure_raw_write(Port *port, const void *ptr, size_t len);
 
 extern bool ssl_loaded_verify_locations;
 
+WaitEventSet *FeBeWaitSet;
+
 /* GUCs */
 extern char *SSLCipherSuites;
 extern char *SSLECDHCurve;
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 3813226..c72635c 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -530,6 +530,9 @@
 /* Define to 1 if you have the syslog interface. */
 #undef HAVE_SYSLOG
 
+/* Define to 1 if you have the <sys/epoll.h> header file. */
+#undef HAVE_SYS_EPOLL_H
+
 /* Define to 1 if you have the <sys/ioctl.h> header file. */
 #undef HAVE_SYS_IOCTL_H
 
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 1b9521f..0d5fb77 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -68,6 +68,12 @@
  * use of any generic handler.
  *
  *
+ * WaitEventSets allow to wait for latches being set and additional events -
+ * postmaster dying and socket readiness of several sockets currently - at the
+ * same time.  On many platforms using a long lived event set is more
+ * efficient than using WaitLatch or WaitLatchOrSocket.
+ *
+ *
  * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
@@ -95,13 +101,26 @@ typedef struct Latch
 #endif
 } Latch;
 
-/* Bitmasks for events that may wake-up WaitLatch() clients */
+/*
+ * Bitmasks for events that may wake-up WaitLatch(), WaitLatchOrSocket(), or
+ * WaitEventSetWait().
+ */
 #define WL_LATCH_SET		 (1 << 0)
 #define WL_SOCKET_READABLE	 (1 << 1)
 #define WL_SOCKET_WRITEABLE  (1 << 2)
-#define WL_TIMEOUT			 (1 << 3)
+#define WL_TIMEOUT			 (1 << 3) /* not for WaitEventSetWait() */
 #define WL_POSTMASTER_DEATH  (1 << 4)
 
+typedef struct WaitEvent
+{
+	int			pos;			/* position in the event data structure */
+	uint32		events;			/* triggered events */
+	pgsocket	fd;				/* socket fd associated with event */
+} WaitEvent;
+
+/* forward declaration to avoid exposing latch.c implementation details */
+typedef struct WaitEventSet WaitEventSet;
+
 /*
  * prototypes for functions in latch.c
  */
@@ -110,12 +129,18 @@ extern void InitLatch(volatile Latch *latch);
 extern void InitSharedLatch(volatile Latch *latch);
 extern void OwnLatch(volatile Latch *latch);
 extern void DisownLatch(volatile Latch *latch);
-extern int	WaitLatch(volatile Latch *latch, int wakeEvents, long timeout);
-extern int WaitLatchOrSocket(volatile Latch *latch, int wakeEvents,
-				  pgsocket sock, long timeout);
 extern void SetLatch(volatile Latch *latch);
 extern void ResetLatch(volatile Latch *latch);
 
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern void FreeWaitEventSet(WaitEventSet *set);
+extern int	AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd, Latch *latch);
+extern void ModifyWaitEvent(WaitEventSet *set, int pos, uint32 events, Latch *latch);
+
+extern int	WaitEventSetWait(WaitEventSet *set, long timeout, WaitEvent *occurred_events, int nevents);
+extern int	WaitLatch(volatile Latch *latch, int wakeEvents, long timeout);
+extern int WaitLatchOrSocket(volatile Latch *latch, int wakeEvents,
+				  pgsocket sock, long timeout);
 
 /*
  * Unix implementation uses SIGUSR1 for inter-process signaling.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b850db0..c2511de 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2113,6 +2113,8 @@ WalSnd
 WalSndCtlData
 WalSndSendDataCallback
 WalSndState
+WaitEvent
+WaitEventSet
 WholeRowVarExprState
 WindowAgg
 WindowAggState
-- 
2.7.0.229.g701fa7f

#78

Thomas Munro

thomas.munro@enterprisedb.com

almost 10 years ago

In reply to: Andres Freund (#77)

Re: Performance degradation in commit ac1d794

On Sun, Mar 20, 2016 at 5:14 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-03-19 18:45:36 -0700, Andres Freund wrote:

On 2016-03-19 16:44:49 +0530, Amit Kapila wrote:

On Fri, Mar 18, 2016 at 1:34 PM, Andres Freund <andres@anarazel.de> wrote:

Attached is a significantly revised version of the earlier series. Most
importantly I have:
* Unified the window/unix latch implementation into one file (0004)

After applying patch 0004* on HEAD, using command patch -p1 <
<path_of_patch>, I am getting build failure:

c1 : fatal error C1083: Cannot open source file:
'src/backend/storage/ipc/latch.c': No such file or directory

I think it could not rename port/unix_latch.c => storage/ipc/latch.c. I
have tried with git apply, but no success. Am I doing something wrong?

You skipped applying 0003.

I'll send an updated version - with all the docs and such - in the next
hours.

Here we go. I think this is getting pretty clos eto being committable,
minus a bit of testing edge cases on unix (postmaster death,
disconnecting clients in various ways (especially with COPY)) and
windows (uh, does it even work at all?).

There's no large code changes in this revision, mainly some code
polishing and a good bit more comment improvements.

I couldn't get the second patch to apply for some reason, but I have
been trying out your "latch" branch on some different OSs and porting
some code that does a bunch of waiting on many sockets over to this
API to try it out.

One thing that I want to do but can't with this interface is remove an
fd from the set. I can AddWaitEventToSet returning a position, and I
can ModifyWait to provide new event mask by position including zero
mask, I can't actually remove the fd (for example to avoid future
error events that can't be masked, or to allow that fd to be closed
and perhaps allow that fd number to coincidentally be readded later,
and just generally to free the slot). There is an underlying way to
remove an fd from a set with poll (sort of), epoll, kqueue. (Not sure
about Windows. But surely...). I wonder if there should be
RemoveWaitEventFromSet(set, position) which recycles event slots,
sticking them on a freelist (and setting corresponding pollfd structs'
fd to -1).

I wonder if it would be useful to reallocate the arrays as needed so
there isn't really a hard limit to the number of things you add, just
an initial size.

A couple of typos:

+ * Wait using linux' epoll_wait(2).

linux's

+       /*
+        * If neither the event mask nor the associated latch changes, return
+        * early. That's an important optimization for some sockets, were
+        * ModifyWaitEvent is frequently used to switch from waiting
for reads to
+        * waiting on writes.
+        */

s/were/where/

+       /*
+        * Even though unused, we also poss epoll_ev as the data argument if
+        * EPOLL_CTL_DELETE is passed as action.  There used to be an epoll bug
+        * requiring that, and acutally it makes the code simpler...
+        */

s/poss/pass/
s/EPOLL_CTL_DELETE/EPOLL_CTL_DEL/
s/acutally/actually/

There is no code passing EPOLL_CTL_DEL, but maybe this comment is a
clue that you have already implemented remove on some other branch...

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#79

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Thomas Munro (#78)

Re: Performance degradation in commit ac1d794

On 2016-03-21 01:31:30 +1300, Thomas Munro wrote:

I couldn't get the second patch to apply for some reason,

Weird? Even efter appying the first one first?

but I have been trying out your "latch" branch on some different OSs
and porting some code that does a bunch of waiting on many sockets
over to this API to try it out.

One thing that I want to do but can't with this interface is remove an
fd from the set.

Yea, I didn't see an in-core need for that, so I didn't want to add
that yet.

Other than that, how did things go?

I can AddWaitEventToSet returning a position, and I
can ModifyWait to provide new event mask by position including zero
mask, I can't actually remove the fd (for example to avoid future
error events that can't be masked, or to allow that fd to be closed
and perhaps allow that fd number to coincidentally be readded later,
and just generally to free the slot). There is an underlying way to
remove an fd from a set with poll (sort of), epoll, kqueue. (Not sure
about Windows. But surely...). I wonder if there should be
RemoveWaitEventFromSet(set, position) which recycles event slots,
sticking them on a freelist (and setting corresponding pollfd structs'
fd to -1).

I'm inclined that if we add this, we memmmove everything later, and
adjust the offsets. As we e.g. pass the entire pollfd array to poll() at
once, not having gaps will be more important.

What's the use case where you need that?

I wonder if it would be useful to reallocate the arrays as needed so
there isn't really a hard limit to the number of things you add, just
an initial size.

Doubles the amount of palloc()s required. As lots of these sets are
actually going to be very short-lived (a single WaitLatch call), that's
something I'd rather avoid. Although I guess you could just allocate
separate arrays at that moment.

A couple of typos:

+ * Wait using linux' epoll_wait(2).

linux's

+       /*
+        * If neither the event mask nor the associated latch changes, return
+        * early. That's an important optimization for some sockets, were
+        * ModifyWaitEvent is frequently used to switch from waiting
for reads to
+        * waiting on writes.
+        */

s/were/where/

+       /*
+        * Even though unused, we also poss epoll_ev as the data argument if
+        * EPOLL_CTL_DELETE is passed as action.  There used to be an epoll bug
+        * requiring that, and acutally it makes the code simpler...
+        */

s/poss/pass/
s/EPOLL_CTL_DELETE/EPOLL_CTL_DEL/
s/acutally/actually/

Thanks!

There is no code passing EPOLL_CTL_DEL, but maybe this comment is a
clue that you have already implemented remove on some other branch...

No, I've not, sorry...

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#80

Thomas Munro

thomas.munro@enterprisedb.com

almost 10 years ago

In reply to: Andres Freund (#79)

Re: Performance degradation in commit ac1d794

On Mon, Mar 21, 2016 at 3:46 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-03-21 01:31:30 +1300, Thomas Munro wrote:

I couldn't get the second patch to apply for some reason,

Weird? Even efter appying the first one first?

Ah, I was using patch -p1. I needed to use git am, which understands
how to rename stuff.

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#81

Thomas Munro

thomas.munro@enterprisedb.com

almost 10 years ago

In reply to: Andres Freund (#79)

1 attachment(s)

Re: Performance degradation in commit ac1d794

On Mon, Mar 21, 2016 at 3:46 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-03-21 01:31:30 +1300, Thomas Munro wrote:

I couldn't get the second patch to apply for some reason,

Weird? Even efter appying the first one first?

but I have been trying out your "latch" branch on some different OSs
and porting some code that does a bunch of waiting on many sockets
over to this API to try it out.

One thing that I want to do but can't with this interface is remove an
fd from the set.

Yea, I didn't see an in-core need for that, so I didn't want to add
that yet.

Other than that, how did things go?

So far, so good. No surprises.

I can AddWaitEventToSet returning a position, and I
can ModifyWait to provide new event mask by position including zero
mask, I can't actually remove the fd (for example to avoid future
error events that can't be masked, or to allow that fd to be closed
and perhaps allow that fd number to coincidentally be readded later,
and just generally to free the slot). There is an underlying way to
remove an fd from a set with poll (sort of), epoll, kqueue. (Not sure
about Windows. But surely...). I wonder if there should be
RemoveWaitEventFromSet(set, position) which recycles event slots,
sticking them on a freelist (and setting corresponding pollfd structs'
fd to -1).

I'm inclined that if we add this, we memmmove everything later, and
adjust the offsets. As we e.g. pass the entire pollfd array to poll() at
once, not having gaps will be more important.

What's the use case where you need that?

I was experimenting with a change to Append so that it could deal with
asynchronous subplans, and in particular support for asynchronous
foreign scans. See attached patch which applies on top of your latch
branch (or your two patches above, I assume) which does that. It is
not a very ambitious form of asynchrony and there are opportunities to
do so much more in this area (about which more later in some future
thread), but it is some relevant educational code I had to hand, so I
ported it to your new API as a way to try the API out.

The contract that I invented here is that an async-aware parent node
can ask any child node "are you ready?" and get back various answers
including an fd which means please don't call ExecProcNode until this
fd is ready to read. But after ExecProcNode is called, the fd must
not be accessed again (the subplan has the right to close it, return a
different one next time etc), so it must not appear in any
WaitEventSet wait on after that. As you can see, in
append_next_async_wait, it therefore needs to create an new
WaitEventSet every time it needs to wait, which makes it feel more
like select() than epoll(). Ideally it'd just have just one single
WaitEventSet for the lifetime of the append node, and just add and
remove fds as required.

I wonder if it would be useful to reallocate the arrays as needed so
there isn't really a hard limit to the number of things you add, just
an initial size.

Doubles the amount of palloc()s required. As lots of these sets are
actually going to be very short-lived (a single WaitLatch call), that's
something I'd rather avoid. Although I guess you could just allocate
separate arrays at that moment.

On the subject of short-lived and frequent calls to WaitLatchOrSocket,
I wonder if there would be some benefit in reusing a statically
allocated WaitEventSet for that. That would become possible if you
could add and remove the latch and socket as discussed, with an
opportunity to skip the modification work completely if the reusable
WaitEventSet already happens to have the right stuff in it from the
last WaitLatchOrSocket call. Or maybe the hot wait loops should
simply be rewritten to reuse a WaitEventSet explicitly so they can
manage that...

Some other assorted thoughts:

* It looks like epoll (and maybe kqueue?) can associate user data with
an event and give it back to you; if you have a mapping between fds
and some other object (in the case of the attached patch, subplan
nodes that are ready to be pulled), would it be possible and useful to
expose that (and simulate it where needed) rather than the caller
having to maintain associative data structures (see the linear search
in my patch)?

* I would be interested in writing a kqueue implementation of this for
*BSD (and MacOSX?) at some point if someone doesn't beat me to it.

* It would be very cool to get some view into which WaitEventSetWait
call backends are waiting in from a system view if it could be done
cheaply enough. A name/ID for the call site, and an indication of
which latch and how many fds... or something.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

async-append-wait-set-hack.patchapplication/octet-stream; name=async-append-wait-set-hack.patchDownload

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 189f290..d21d187 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -47,6 +47,7 @@ typedef struct ConnCacheEntry
 								 * one level of subxact open, etc */
 	bool		have_prep_stmt; /* have we prepared any stmts in this xact? */
 	bool		have_error;		/* have any subxacts aborted in this xact? */
+	PgFdwConnState state;		/* extra per-connection state */
 } ConnCacheEntry;
 
 /*
@@ -92,7 +93,7 @@ static void pgfdw_subxact_callback(SubXactEvent event,
  * mid-transaction anyway.
  */
 PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+GetConnection(UserMapping *user, bool will_prep_stmt, PgFdwConnState **state)
 {
 	bool		found;
 	ConnCacheEntry *entry;
@@ -137,6 +138,7 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		entry->xact_depth = 0;
 		entry->have_prep_stmt = false;
 		entry->have_error = false;
+		memset(&entry->state, 0, sizeof(entry->state));
 	}
 
 	/*
@@ -171,6 +173,10 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 	/* Remember if caller will prepare statements */
 	entry->have_prep_stmt |= will_prep_stmt;
 
+	/* If caller needs access to the per-connection state, return it. */
+	if (state)
+		*state = &entry->state;
+
 	return entry->conn;
 }
 
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index d6db834..61e91bd 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -32,6 +32,7 @@
 #include "optimizer/var.h"
 #include "optimizer/tlist.h"
 #include "parser/parsetree.h"
+#include "utils/asynchrony.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
@@ -157,6 +158,9 @@ typedef struct PgFdwScanState
 	MemoryContext temp_cxt;		/* context for per-tuple temporary data */
 
 	int			fetch_size;		/* number of tuples per fetch */
+
+	/* per-connection state */
+	PgFdwConnState *conn_state;
 } PgFdwScanState;
 
 /*
@@ -346,6 +350,8 @@ static void postgresGetForeignJoinPaths(PlannerInfo *root,
 static bool postgresRecheckForeignScan(ForeignScanState *node,
 						   TupleTableSlot *slot);
 
+static int postgresReady(ForeignScanState *node);
+
 /*
  * Helper functions
  */
@@ -365,6 +371,7 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 						  EquivalenceClass *ec, EquivalenceMember *em,
 						  void *arg);
 static void create_cursor(ForeignScanState *node);
+static void fetch_more_data_begin(ForeignScanState *node);
 static void fetch_more_data(ForeignScanState *node);
 static void close_cursor(PGconn *conn, unsigned int cursor_number);
 static void prepare_foreign_modify(PgFdwModifyState *fmstate);
@@ -457,6 +464,9 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for join push-down */
 	routine->GetForeignJoinPaths = postgresGetForeignJoinPaths;
 
+	/* Support for asynchrony */
+	routine->Ready = postgresReady;
+
 	PG_RETURN_POINTER(routine);
 }
 
@@ -1288,7 +1298,7 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(user, false);
+	fsstate->conn = GetConnection(user, false, &fsstate->conn_state);
 
 	/* Assign a unique ID for my cursor */
 	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
@@ -1337,6 +1347,7 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 							 &fsstate->param_flinfo,
 							 &fsstate->param_exprs,
 							 &fsstate->param_values);
+	fsstate->conn_state->async_query_sent = false;
 }
 
 /*
@@ -1663,7 +1674,7 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	user = GetUserMapping(userid, table->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(user, true);
+	fmstate->conn = GetConnection(user, true, NULL);
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
@@ -2238,7 +2249,7 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	dmstate->conn = GetConnection(user, false);
+	dmstate->conn = GetConnection(user, false, NULL);
 
 	/* Initialize state variable */
 	dmstate->num_tuples = -1;		/* -1 means not set yet */
@@ -2500,7 +2511,7 @@ estimate_path_cost_size(PlannerInfo *root,
 								NULL);
 
 		/* Get the remote estimate */
-		conn = GetConnection(fpinfo->user, false);
+		conn = GetConnection(fpinfo->user, false, NULL);
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
@@ -2864,13 +2875,28 @@ fetch_more_data(ForeignScanState *node)
 		int			numrows;
 		int			i;
 
-		snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
-				 fsstate->fetch_size, fsstate->cursor_number);
+		if (!fsstate->conn_state->async_query_sent)
+		{
+			/* This is a regular synchronous fetch. */
+			snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+					 fsstate->fetch_size, fsstate->cursor_number);
 
-		res = PQexec(conn, sql);
-		/* On error, report the original query, not the FETCH. */
-		if (PQresultStatus(res) != PGRES_TUPLES_OK)
-			pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
+			res = PQexec(conn, sql);
+			/* On error, report the original query, not the FETCH. */
+			if (PQresultStatus(res) != PGRES_TUPLES_OK)
+				pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
+		}
+		else
+		{
+			/*
+			 * The query was already sent by an earlier call to
+			 * fetch_more_data_begin.  So now we just fetch the result.
+			 */
+			res = PQgetResult(conn);
+			/* On error, report the original query, not the FETCH. */
+			if (PQresultStatus(res) != PGRES_TUPLES_OK)
+				pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
+		}
 
 		/* Convert the data into HeapTuples */
 		numrows = PQntuples(res);
@@ -2899,6 +2925,15 @@ fetch_more_data(ForeignScanState *node)
 		fsstate->eof_reached = (numrows < fsstate->fetch_size);
 
 		PQclear(res);
+
+		/* If this was the second part of an async request, we must fetch until NULL. */
+		if (fsstate->conn_state->async_query_sent)
+		{
+			/* call once and raise error if not NULL as expected? */
+			while (PQgetResult(conn) != NULL)
+				;
+			fsstate->conn_state->async_query_sent = false;
+		}
 		res = NULL;
 	}
 	PG_CATCH();
@@ -2913,6 +2948,35 @@ fetch_more_data(ForeignScanState *node)
 }
 
 /*
+ * Begin an asynchronous data fetch.
+ * fetch_more_data must be called to fetch the results..
+ */
+static void
+fetch_more_data_begin(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PGconn	   *conn = fsstate->conn;
+	char		sql[64];
+
+	Assert(!fsstate->conn_state->async_query_sent);
+
+	/*
+	 * Create the cursor synchronously.  (With more state machine stuff we
+	 * could do this asynchronously too).
+	 */
+	if (!fsstate->cursor_exists)
+		create_cursor(node);
+
+	/* We will send this query, but not wait for the response. */
+	snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+			 fsstate->fetch_size, fsstate->cursor_number);
+
+	if (PQsendQuery(conn, sql) < 0)
+		pgfdw_report_error(ERROR, NULL, conn, false, fsstate->query);
+	fsstate->conn_state->async_query_sent = true;
+}
+
+/*
  * Force assorted GUC parameters to settings that ensure that we'll output
  * data values in a form that is unambiguous to the remote server.
  *
@@ -3342,7 +3406,7 @@ postgresAnalyzeForeignTable(Relation relation,
 	 */
 	table = GetForeignTable(RelationGetRelid(relation));
 	user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
-	conn = GetConnection(user, false);
+	conn = GetConnection(user, false, NULL);
 
 	/*
 	 * Construct command to get page count for relation.
@@ -3434,7 +3498,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
 	table = GetForeignTable(RelationGetRelid(relation));
 	server = GetForeignServer(table->serverid);
 	user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
-	conn = GetConnection(user, false);
+	conn = GetConnection(user, false, NULL);
 
 	/*
 	 * Construct cursor that retrieves whole rows from remote.
@@ -3657,7 +3721,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
 	 */
 	server = GetForeignServer(serverOid);
 	mapping = GetUserMapping(GetUserId(), server->serverid);
-	conn = GetConnection(mapping, false);
+	conn = GetConnection(mapping, false, NULL);
 
 	/* Don't attempt to import collation if remote server hasn't got it */
 	if (PQserverVersion(conn) < 90100)
@@ -4264,6 +4328,41 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
 	/* XXX Consider parameterized paths for the join relation */
 }
 
+static int
+postgresReady(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+
+	if (fsstate->conn_state->async_query_sent)
+	{
+		/*
+		 * We have already started a query, for some other executor node.  We
+		 * currently can't handle two at the same time (we'd have to create
+		 * more connections for that).
+		 */
+		return ASYNC_READY_BUSY;
+	}
+	else if (fsstate->next_tuple < fsstate->num_tuples)
+	{
+		/* We already have buffered tuples. */
+		return ASYNC_READY_MORE;
+	}
+	else if (fsstate->eof_reached)
+	{
+		/* We have already hit the end of the scan. */
+		return ASYNC_READY_EOF;
+	}
+	else
+	{
+		/*
+		 * We will start a query now, and tell the caller to wait until the
+		 * file descriptor says we're ready and then call ExecProcNode.
+		 */
+		fetch_more_data_begin(node);
+		return PQsocket(fsstate->conn);
+	}
+}
+
 /*
  * Create a tuple from the specified row of the PGresult.
  *
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 3a11d99..25bd426 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -21,6 +21,15 @@
 #include "libpq-fe.h"
 
 /*
+ * Extra control information relating to a connection.
+ */
+typedef struct PgFdwConnState
+{
+	/* Has an asynchronous query been sent? */
+	bool async_query_sent;
+} PgFdwConnState;
+
+/*
  * FDW-specific planner information kept in RelOptInfo.fdw_private for a
  * foreign table.  This information is collected by postgresGetForeignRelSize.
  */
@@ -99,7 +108,8 @@ extern int	set_transmission_modes(void);
 extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
-extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt,
+							 PgFdwConnState **state);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index a31dbc9..9565f35 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -116,6 +116,7 @@
 #include "executor/nodeWorktablescan.h"
 #include "nodes/nodeFuncs.h"
 #include "miscadmin.h"
+#include "utils/asynchrony.h"
 
 
 /* ------------------------------------------------------------------------
@@ -786,6 +787,30 @@ ExecEndNode(PlanState *node)
 }
 
 /*
+ * ExecReady
+ *
+ * Check whether the node would be able to produce a new tuple without
+ * blocking.  ASYNC_READY_MORE means a tuple can be returned by ExecProcNode
+ * immediately without waiting.  ASYNC_READY_EOF means there are no further
+ * tuples to consume.  ASYNC_READY_UNSUPPORTED means that this node doesn't
+ * support asynchronous interaction.  ASYNC_READY_BUSY means that this node
+ * currently can't provide asynchronous service.  Any other value is a file
+ * descriptor which can be used to wait until the node is ready to produce a
+ * tuple.
+ */
+int
+ExecReady(PlanState *node)
+{
+	switch (nodeTag(node))
+	{
+		case T_ForeignScanState:
+			return ExecForeignScanReady((ForeignScanState *) node);
+		default:
+			return ASYNC_READY_UNSUPPORTED;
+	}
+}
+
+/*
  * ExecShutdownNode
  *
  * Give execution nodes a chance to stop asynchronous resource consumption
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index a26bd63..7501483 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -59,6 +59,8 @@
 
 #include "executor/execdebug.h"
 #include "executor/nodeAppend.h"
+#include "storage/latch.h"
+#include "utils/asynchrony.h"
 
 static bool exec_append_initialize_next(AppendState *appendstate);
 
@@ -181,9 +183,207 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendstate->as_whichplan = 0;
 	exec_append_initialize_next(appendstate);
 
+	/*
+	 * Initially we consider all subplans to be potentially asynchronous.
+	 */
+	appendstate->asyncplans = (PlanState **) palloc(nplans * sizeof(PlanState *));
+	appendstate->asyncfds = (int *) palloc0(nplans * sizeof(int));
+	appendstate->nasyncplans = nplans;
+	memcpy(appendstate->asyncplans, appendstate->appendplans, nplans * sizeof(PlanState *));
+	appendstate->lastreadyplan = 0;
+
 	return appendstate;
 }
 
+/*
+ * Forget about an asynchronous subplan, given an async subplan index.  Return
+ * the index of the next subplan.
+ */
+static int
+forget_async_subplan(AppendState *node, int i)
+{
+	int last = node->nasyncplans - 1;
+
+	if (i == last)
+	{
+		/* This was the last subplan, forget it and move to first. */
+		i = 0;
+		if (node->lastreadyplan == last)
+			node->lastreadyplan = 0;
+	}
+	else
+	{
+		/*
+		 * Move the last one here (cheaper than memmov'ing the whole array
+		 * down and we don't care about the order).
+		 */
+		node->asyncplans[i] = node->asyncplans[last];
+		node->asyncfds[i] = node->asyncfds[last];
+	}
+	--node->nasyncplans;
+
+	return i;
+}
+
+/*
+ * Wait for the first asynchronous subplan's file descriptor to be ready to
+ * read or error, and then ask it for a tuple.
+ *
+ * This is called by append_next_async when every async subplan has provided a
+ * file descriptor to wait on, so we must begin waiting.
+ */
+static TupleTableSlot *
+append_next_async_wait(AppendState *node)
+{
+	while (node->nasyncplans > 0)
+	{
+		WaitEventSet *set;
+		WaitEvent event;
+		int i;
+
+		/*
+		 * For now there is no facility to remove fds from WaitEventSets when
+		 * they are no longer interesting, so we allocate, populate, free
+		 * every time, a la select().  If we had RemoveWaitEventFromSet, we
+		 * could use the same WaitEventSet object for the life of the append
+		 * node, and add/remove as we go, a la epoll/kqueue.
+		 *
+		 * Note: We could make a single call to WaitEventSetWait and have a
+		 * big enough output event buffer to learn about readiness on all
+		 * interesting sockets and loop over those, but one implementation can
+		 * only tell us about a single socket at a time, so we need to be
+		 * prepared to call WaitEventSetWait repeatedly.
+		 */
+		set = CreateWaitEventSet(CurrentMemoryContext, node->nasyncplans + 1);
+		AddWaitEventToSet(set, WL_POSTMASTER_DEATH, PGINVALID_SOCKET, NULL);
+		for (i = 0; i < node->nasyncplans; ++i)
+		{
+			Assert(node->asyncfds[i] > 0);
+			AddWaitEventToSet(set, WL_SOCKET_READABLE, node->asyncfds[i], NULL);
+		}
+		i = WaitEventSetWait(set, -1, &event, 1);
+		Assert(i > 0);
+		FreeWaitEventSet(set);
+
+		if (event.events & WL_POSTMASTER_DEATH)
+			exit(0);
+		if (event.events & WL_SOCKET_READABLE)
+		{
+			/* Linear search for the node that told us to wait for this fd. */
+			for (i = 0; i < node->nasyncplans; ++i)
+			{
+				if (event.fd == node->asyncfds[i])
+				{
+					TupleTableSlot *result;
+
+					/*
+					 * We assume that because the fd is ready, it can produce
+					 * a tuple now, which is not perfect.  An improvement
+					 * would be if it could say 'not yet, I'm still not
+					 * ready', so eg postgres_fdw could PQconsumeInput and
+					 * then say 'I need more input'.
+					 */
+					result = ExecProcNode(node->asyncplans[i]);
+					if (!TupIsNull(result))
+					{
+						/*
+						 * Remember this plan so that append_next_async will
+						 * keep trying this subplan first until it stops
+						 * feeding us buffered tuples.
+						 */
+						node->lastreadyplan = i;
+						/* We can stop waiting for this fd. */
+						node->asyncfds[i] = 0;
+						return result;
+					}
+					else
+					{
+						/*
+						 * This subplan has reached EOF.  We'll go back and
+						 * wait for another one.
+						 */
+						forget_async_subplan(node, i);
+						break;
+					}
+				}
+			}
+		}
+	}
+	/*
+	 * We visited every ready subplan, tried to pull a tuple, and they all
+	 * reported EOF.  There is no more async data available.
+	 */
+	return NULL;
+}
+
+/*
+ * Fetch the next tuple available from any asynchronous subplan.  If none can
+ * provide a tuple immediately, wait for the first one that is ready to
+ * provide a tuple.  Return NULL when there are no more tuples available.
+ */
+static TupleTableSlot *
+append_next_async(AppendState *node)
+{
+	int count;
+	int i;
+
+	/*
+	 * We'll start our scan of subplans at the last one that was able to give
+	 * us a tuple, if there was one.  It may be able to give us a new tuple
+	 * straight away so we can leave early.
+	 */
+	i = node->lastreadyplan;
+
+	/* Loop until we've visited each potentially async subplan. */
+	for (count = node->nasyncplans; count > 0; --count)
+	{
+		/*
+		 * If we don't already have a file descriptor to wait on for this
+		 * subplan, see if it is ready.
+		 */
+		if (node->asyncfds[i] == 0)
+		{
+			int ready = ExecReady(node->asyncplans[i]);
+
+			switch (ready)
+			{
+			case ASYNC_READY_MORE:
+				/* The node has a buffered tuple for us. */
+				return ExecProcNode(node->asyncplans[i]);
+
+			case ASYNC_READY_UNSUPPORTED:
+			case ASYNC_READY_EOF:
+			case ASYNC_READY_BUSY:
+				/* This subplan can't give us anything asynchronously. */
+				i = forget_async_subplan(node, i);
+				continue;
+
+			default:
+				/* We have a new file descriptor to wait for. */
+				Assert(ready > 0);
+				node->asyncfds[i] = ready;
+				node->lastreadyplan = 0;
+				break;
+			}
+		}
+
+		/* Move on to the next plan (circular). */
+		i = (i + 1) % node->nasyncplans;
+	}
+
+	/* We might have removed all subplans; if so we can leave now. */
+	if (node->nasyncplans == 0)
+		return NULL;
+
+	/*
+	 * If we reached here, then all remaining async subplans have given us a
+	 * file descriptor to wait for.  So do that, and pull a tuple as soon as
+	 * one is ready.
+	 */
+	return append_next_async_wait(node);
+}
+
+
 /* ----------------------------------------------------------------
  *	   ExecAppend
  *
@@ -193,6 +393,17 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 TupleTableSlot *
 ExecAppend(AppendState *node)
 {
+	/* First, drain all asynchronous subplans as they become ready. */
+	if (node->nasyncplans > 0)
+	{
+		TupleTableSlot *result = append_next_async(node);
+
+		if (!TupIsNull(result))
+			return result;
+	}
+	Assert(node->nasyncplans == 0);
+
+	/* Next process regular synchronous nodes sequentially. */
 	for (;;)
 	{
 		PlanState  *subnode;
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 300f947..70796d1 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -25,6 +25,7 @@
 #include "executor/executor.h"
 #include "executor/nodeForeignscan.h"
 #include "foreign/fdwapi.h"
+#include "utils/asynchrony.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -355,3 +356,14 @@ ExecForeignScanInitializeWorker(ForeignScanState *node, shm_toc *toc)
 		fdwroutine->InitializeWorkerForeignScan(node, toc, coordinate);
 	}
 }
+
+int
+ExecForeignScanReady(ForeignScanState *node)
+{
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	if (fdwroutine->Ready)
+		return fdwroutine->Ready(node);
+	else
+		return ASYNC_READY_UNSUPPORTED;
+}
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 44fac27..e364a8d 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -224,6 +224,7 @@ extern void EvalPlanQualEnd(EPQState *epqstate);
 extern PlanState *ExecInitNode(Plan *node, EState *estate, int eflags);
 extern TupleTableSlot *ExecProcNode(PlanState *node);
 extern Node *MultiExecProcNode(PlanState *node);
+extern int ExecReady(PlanState *node);
 extern void ExecEndNode(PlanState *node);
 extern bool ExecShutdownNode(PlanState *node);
 
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index c255329..b1f3168 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -29,4 +29,6 @@ extern void ExecForeignScanInitializeDSM(ForeignScanState *node,
 extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
 											shm_toc *toc);
 
+extern int ExecForeignScanReady(ForeignScanState *node);
+
 #endif   /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 096a9c4..06aada7 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -153,6 +153,8 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 															 RelOptInfo *rel,
 														 RangeTblEntry *rte);
 
+typedef int (*Ready_function) (ForeignScanState *node);
+
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
  * function.  It provides pointers to the callback functions needed by the
@@ -222,6 +224,9 @@ typedef struct FdwRoutine
 	EstimateDSMForeignScan_function EstimateDSMForeignScan;
 	InitializeDSMForeignScan_function InitializeDSMForeignScan;
 	InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+	/* Support functions for asynchronous processing */
+	Ready_function Ready;
 } FdwRoutine;
 
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0113e5c..7d2881a 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1151,6 +1151,11 @@ typedef struct AppendState
 	PlanState **appendplans;	/* array of PlanStates for my inputs */
 	int			as_nplans;
 	int			as_whichplan;
+
+	PlanState **asyncplans;
+	int		   *asyncfds;
+	int			nasyncplans;
+	int			lastreadyplan;
 } AppendState;
 
 /* ----------------
diff --git a/src/include/utils/asynchrony.h b/src/include/utils/asynchrony.h
new file mode 100644
index 0000000..c3165e9
--- /dev/null
+++ b/src/include/utils/asynchrony.h
@@ -0,0 +1,36 @@
+/*-------------------------------------------------------------------------
+ *
+ * asynchrony.h
+ *		  Asynchrony-related types and interfaces.
+ *
+ * Portions Copyright (c) 2016, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		  src/include/utils/asynchrony.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef ASYNCHRONY_H
+#define ASYNCHRONY_H
+
+/*
+ * Special values used by the FDW interface and the executor, for dealing with
+ * asynchronous tuple iteration.
+ */
+
+/*
+ * Asynchronous processing is not currently available (because an asynchronous
+ * request is already in progress).
+ */
+#define ASYNC_READY_BUSY -3
+
+/* There are no more tuples. */
+#define ASYNC_READY_EOF -2
+
+/* This FDW or executor node does not support asynchronous processing. */
+#define ASYNC_READY_UNSUPPORTED -1
+
+/* More tuples are available immediately without waiting. */
+#define ASYNC_READY_MORE 0
+
+#endif

#82

Amit Kapila

amit.kapila16@gmail.com

almost 10 years ago

In reply to: Andres Freund (#75)

Re: Performance degradation in commit ac1d794

On Sun, Mar 20, 2016 at 7:13 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-03-19 15:43:27 +0530, Amit Kapila wrote:

On Sat, Mar 19, 2016 at 12:40 PM, Andres Freund <andres@anarazel.de>

wrote:

On March 18, 2016 11:52:08 PM PDT, Amit Kapila <

amit.kapila16@gmail.com>

wrote:

Won't the new code needs to ensure that ResetEvent(latchevent)

should

get
called in case WaitForMultipleObjects() comes out when both
pgwin32_signal_event and latchevent are signalled at the same

time?

WaitForMultiple only reports the readiness of on event at a time,

no?

I don't think so, please read link [1] with a focus on below

paragraph

which states how it reports the readiness or signaled state when
multiple
objects become signaled.

"When *bWaitAll* is *FALSE*, this function checks the handles in the
array
in order starting with index 0, until one of the objects is signaled.
If
multiple objects become signaled, the function returns the index of

the

first handle in the array whose object was signaled."

I think this is just incredibly bad documentation. See
https://blogs.msdn.microsoft.com/oldnewthing/20150409-00/?p=44273
(Raymond Chen can be considered an authority here imo).

The article pointed by you justifies that the way ResetEvent is done by
patch is correct. I am not sure, but you can weigh, if there is a need of
comment so that if we want enhance this part of code (or want to write
something similar) in future, we don't need to rediscover this fact.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#83

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Amit Kapila (#82)

Re: Performance degradation in commit ac1d794

On March 21, 2016 5:12:38 AM GMT+01:00, Amit Kapila <amit.kapila16@gmail.com> wrote:

The article pointed by you justifies that the way ResetEvent is done by
patch is correct. I am not sure, but you can weigh, if there is a need
of
comment so that if we want enhance this part of code (or want to write
something similar) in future, we don't need to rediscover this fact.

I've added a reference in a comment.

Did you have a chance of running the patched versions on windows?

I plan to push this sometime today, so I can get on to some performance patches I was planning to look into committing.

Andres

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#84

Amit Kapila

amit.kapila16@gmail.com

almost 10 years ago

In reply to: Andres Freund (#83)

Re: Performance degradation in commit ac1d794

On Mon, Mar 21, 2016 at 10:21 AM, Andres Freund <andres@anarazel.de> wrote:

On March 21, 2016 5:12:38 AM GMT+01:00, Amit Kapila <
amit.kapila16@gmail.com> wrote:

The article pointed by you justifies that the way ResetEvent is done by
patch is correct. I am not sure, but you can weigh, if there is a need
of
comment so that if we want enhance this part of code (or want to write
something similar) in future, we don't need to rediscover this fact.

I've added a reference in a comment.

Did you have a chance of running the patched versions on windows?

I am planning to do it in next few hours.

I plan to push this sometime today, so I can get on to some performance
patches I was planning to look into committing.

have we done testing to ensure that it actually mitigate the impact of
performance degradation due to commit ac1d794. I wanted to do that, but
unfortunately the hight-end m/c on which this problem is reproducible is
down.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#85

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Thomas Munro (#81)

Re: Performance degradation in commit ac1d794

Hi,

On 2016-03-21 11:52:43 +1300, Thomas Munro wrote:

The contract that I invented here is that an async-aware parent node
can ask any child node "are you ready?" and get back various answers
including an fd which means please don't call ExecProcNode until this
fd is ready to read. But after ExecProcNode is called, the fd must
not be accessed again (the subplan has the right to close it, return a
different one next time etc), so it must not appear in any
WaitEventSet wait on after that.

Hm, why did you choose to go with that contract? Why do children need to
switch fds and such?

As you can see, in append_next_async_wait, it therefore needs to
create an new WaitEventSet every time it needs to wait, which makes it
feel more like select() than epoll(). Ideally it'd just have just one
single WaitEventSet for the lifetime of the append node, and just add
and remove fds as required.

Yes, I see. I think that'd be less efficient than not manipulating the
set all the time, but still a lot more efficient than adding/removing
the postmaster death latch every round.

Anyway, I want to commit this ASAP, so I can get on to some patches in
the CF. But I'd welcome you playing with adding that ability.

On the subject of short-lived and frequent calls to WaitLatchOrSocket,
I wonder if there would be some benefit in reusing a statically
allocated WaitEventSet for that.

Wondered the same. It looks like there'd be a number (just latch, latch
+ postmaster death), for it to be beneficial. Most latch waits aren't
that frequent though, so I've decided not to initially go there.

That would become possible if you
could add and remove the latch and socket as discussed, with an
opportunity to skip the modification work completely if the reusable
WaitEventSet already happens to have the right stuff in it from the
last WaitLatchOrSocket call. Or maybe the hot wait loops should
simply be rewritten to reuse a WaitEventSet explicitly so they can
manage that...

Yea, I think that's better.

* It looks like epoll (and maybe kqueue?) can associate user data with
an event and give it back to you; if you have a mapping between fds
and some other object (in the case of the attached patch, subplan
nodes that are ready to be pulled), would it be possible and useful to
expose that (and simulate it where needed) rather than the caller
having to maintain associative data structures (see the linear search
in my patch)?

We already use the pointer that epoll gives you. But note that the
'WaitEvent' struct already contains the original position the event was
registered at. That's filled in when you call WaitEventSetWait(). So
you could either just build an array based on ->pos, or we could also
add a void *private; or such to the definition.

* I would be interested in writing a kqueue implementation of this for
*BSD (and MacOSX?) at some point if someone doesn't beat me to it.

I hoped that somebody would do that - that'd afaics be the only major
API missing.

* It would be very cool to get some view into which WaitEventSetWait
call backends are waiting in from a system view if it could be done
cheaply enough. A name/ID for the call site, and an indication of
which latch and how many fds... or something.

That's not something I plan to tackle today; we don't have wait event
integration for latches atm. There's some threads somewhere about this,
but that seems mostly independent of the way latches are implemented.

Thanks,

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#86

Amit Kapila

amit.kapila16@gmail.com

almost 10 years ago

In reply to: Amit Kapila (#84)

Re: Performance degradation in commit ac1d794

On Mon, Mar 21, 2016 at 10:26 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Mon, Mar 21, 2016 at 10:21 AM, Andres Freund <andres@anarazel.de>

wrote:

On March 21, 2016 5:12:38 AM GMT+01:00, Amit Kapila <

amit.kapila16@gmail.com> wrote:

The article pointed by you justifies that the way ResetEvent is done by
patch is correct. I am not sure, but you can weigh, if there is a need
of
comment so that if we want enhance this part of code (or want to write
something similar) in future, we don't need to rediscover this fact.

I've added a reference in a comment.

Did you have a chance of running the patched versions on windows?

I am planning to do it in next few hours.

With 0002-Introduce-new-WaitEventSet-API, initdb is successful, but server
start leads to below problem:

LOG: database system was shut down at 2016-03-21 11:17:13 IST
LOG: MultiXact member wraparound protections are now enabled
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
FATAL: failed to set up event for socket: error code 10022
LOG: statistics collector process (PID 668) exited with exit code 1
LOG: autovacuum launcher process (PID 3868) was terminated by exception
0xC0000005
HINT: See C include file "ntstatus.h" for a description of the hexadecimal
value.

I haven't investigated the problem still.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#87

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Amit Kapila (#86)

2 attachment(s)

Re: Performance degradation in commit ac1d794

Hi,m

On 2016-03-21 11:26:43 +0530, Amit Kapila wrote:

LOG: database system was shut down at 2016-03-21 11:17:13 IST
LOG: MultiXact member wraparound protections are now enabled
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
FATAL: failed to set up event for socket: error code 10022
LOG: statistics collector process (PID 668) exited with exit code 1
LOG: autovacuum launcher process (PID 3868) was terminated by exception
0xC0000005
HINT: See C include file "ntstatus.h" for a description of the hexadecimal
value.

We resolved this and two followup issues on IM; all localized issues.
Thanks again for the help!

I've attached two refreshed patches including the relevant fixes, and
the addition of a 'user_data' pointer to events; as desired by Thomas.

I plan to push these after some msvc animals turned green. We might want
to iterate on the API in some parts, but this will fix the regression,
and imo looks good.

Regards,

Andres

Attachments:

0001-Combine-win32-and-unix-latch-implementations.patchtext/x-patch; charset=us-asciiDownload

From 72e2d21c1249b674496f97cd6009c0bda62f6b4d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 21 Mar 2016 09:56:39 +0100
Subject: [PATCH 1/2] Combine win32 and unix latch implementations.

Previously latches for windows and unix had been implemented in
different files. A later patch introduce an expanded wait
infrastructure, keeping the implementation separate would introduce too
much duplication.

This basically just moves the functions, without too much change. The
reason to keep this separate is that it allows blame to continue working
a little less badly; and to make review a tiny bit easier.

Discussion: 20160114143931.GG10941@awork2.anarazel.de
---
 configure                                          |  10 +-
 configure.in                                       |   8 -
 src/backend/Makefile                               |   3 +-
 src/backend/port/.gitignore                        |   1 -
 src/backend/port/Makefile                          |   2 +-
 src/backend/port/win32_latch.c                     | 349 ---------------------
 src/backend/storage/ipc/Makefile                   |   5 +-
 .../{port/unix_latch.c => storage/ipc/latch.c}     | 282 ++++++++++++++++-
 src/include/storage/latch.h                        |   2 +-
 src/tools/msvc/Mkvcbuild.pm                        |   2 -
 10 files changed, 279 insertions(+), 385 deletions(-)
 delete mode 100644 src/backend/port/win32_latch.c
 rename src/backend/{port/unix_latch.c => storage/ipc/latch.c} (74%)

diff --git a/configure b/configure
index a45be67..c10d954 100755
--- a/configure
+++ b/configure
@@ -14786,13 +14786,6 @@ $as_echo "#define USE_WIN32_SHARED_MEMORY 1" >>confdefs.h
   SHMEM_IMPLEMENTATION="src/backend/port/win32_shmem.c"
 fi
 
-# Select latch implementation type.
-if test "$PORTNAME" != "win32"; then
-  LATCH_IMPLEMENTATION="src/backend/port/unix_latch.c"
-else
-  LATCH_IMPLEMENTATION="src/backend/port/win32_latch.c"
-fi
-
 # If not set in template file, set bytes to use libc memset()
 if test x"$MEMSET_LOOP_LIMIT" = x"" ; then
   MEMSET_LOOP_LIMIT=1024
@@ -15868,7 +15861,7 @@ fi
 ac_config_files="$ac_config_files GNUmakefile src/Makefile.global"
 
 
-ac_config_links="$ac_config_links src/backend/port/dynloader.c:src/backend/port/dynloader/${template}.c src/backend/port/pg_sema.c:${SEMA_IMPLEMENTATION} src/backend/port/pg_shmem.c:${SHMEM_IMPLEMENTATION} src/backend/port/pg_latch.c:${LATCH_IMPLEMENTATION} src/include/dynloader.h:src/backend/port/dynloader/${template}.h src/include/pg_config_os.h:src/include/port/${template}.h src/Makefile.port:src/makefiles/Makefile.${template}"
+ac_config_links="$ac_config_links src/backend/port/dynloader.c:src/backend/port/dynloader/${template}.c src/backend/port/pg_sema.c:${SEMA_IMPLEMENTATION} src/backend/port/pg_shmem.c:${SHMEM_IMPLEMENTATION} src/include/dynloader.h:src/backend/port/dynloader/${template}.h src/include/pg_config_os.h:src/include/port/${template}.h src/Makefile.port:src/makefiles/Makefile.${template}"
 
 
 if test "$PORTNAME" = "win32"; then
@@ -16592,7 +16585,6 @@ do
     "src/backend/port/dynloader.c") CONFIG_LINKS="$CONFIG_LINKS src/backend/port/dynloader.c:src/backend/port/dynloader/${template}.c" ;;
     "src/backend/port/pg_sema.c") CONFIG_LINKS="$CONFIG_LINKS src/backend/port/pg_sema.c:${SEMA_IMPLEMENTATION}" ;;
     "src/backend/port/pg_shmem.c") CONFIG_LINKS="$CONFIG_LINKS src/backend/port/pg_shmem.c:${SHMEM_IMPLEMENTATION}" ;;
-    "src/backend/port/pg_latch.c") CONFIG_LINKS="$CONFIG_LINKS src/backend/port/pg_latch.c:${LATCH_IMPLEMENTATION}" ;;
     "src/include/dynloader.h") CONFIG_LINKS="$CONFIG_LINKS src/include/dynloader.h:src/backend/port/dynloader/${template}.h" ;;
     "src/include/pg_config_os.h") CONFIG_LINKS="$CONFIG_LINKS src/include/pg_config_os.h:src/include/port/${template}.h" ;;
     "src/Makefile.port") CONFIG_LINKS="$CONFIG_LINKS src/Makefile.port:src/makefiles/Makefile.${template}" ;;
diff --git a/configure.in b/configure.in
index c298926..47d0f58 100644
--- a/configure.in
+++ b/configure.in
@@ -1976,13 +1976,6 @@ else
   SHMEM_IMPLEMENTATION="src/backend/port/win32_shmem.c"
 fi
 
-# Select latch implementation type.
-if test "$PORTNAME" != "win32"; then
-  LATCH_IMPLEMENTATION="src/backend/port/unix_latch.c"
-else
-  LATCH_IMPLEMENTATION="src/backend/port/win32_latch.c"
-fi
-
 # If not set in template file, set bytes to use libc memset()
 if test x"$MEMSET_LOOP_LIMIT" = x"" ; then
   MEMSET_LOOP_LIMIT=1024
@@ -2178,7 +2171,6 @@ AC_CONFIG_LINKS([
   src/backend/port/dynloader.c:src/backend/port/dynloader/${template}.c
   src/backend/port/pg_sema.c:${SEMA_IMPLEMENTATION}
   src/backend/port/pg_shmem.c:${SHMEM_IMPLEMENTATION}
-  src/backend/port/pg_latch.c:${LATCH_IMPLEMENTATION}
   src/include/dynloader.h:src/backend/port/dynloader/${template}.h
   src/include/pg_config_os.h:src/include/port/${template}.h
   src/Makefile.port:src/makefiles/Makefile.${template}
diff --git a/src/backend/Makefile b/src/backend/Makefile
index b3d5e2e..d22dbbf 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -306,8 +306,7 @@ ifeq ($(PORTNAME), win32)
 endif
 
 distclean: clean
-	rm -f port/tas.s port/dynloader.c port/pg_sema.c port/pg_shmem.c \
-	      port/pg_latch.c
+	rm -f port/tas.s port/dynloader.c port/pg_sema.c port/pg_shmem.c
 
 maintainer-clean: distclean
 	rm -f bootstrap/bootparse.c \
diff --git a/src/backend/port/.gitignore b/src/backend/port/.gitignore
index 7d3ac4a..9f4f1af 100644
--- a/src/backend/port/.gitignore
+++ b/src/backend/port/.gitignore
@@ -1,5 +1,4 @@
 /dynloader.c
-/pg_latch.c
 /pg_sema.c
 /pg_shmem.c
 /tas.s
diff --git a/src/backend/port/Makefile b/src/backend/port/Makefile
index c6b1d20..89549d0 100644
--- a/src/backend/port/Makefile
+++ b/src/backend/port/Makefile
@@ -21,7 +21,7 @@ subdir = src/backend/port
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = atomics.o dynloader.o pg_sema.o pg_shmem.o pg_latch.o $(TAS)
+OBJS = atomics.o dynloader.o pg_sema.o pg_shmem.o $(TAS)
 
 ifeq ($(PORTNAME), darwin)
 SUBDIRS += darwin
diff --git a/src/backend/port/win32_latch.c b/src/backend/port/win32_latch.c
deleted file mode 100644
index bbf1b24..0000000
--- a/src/backend/port/win32_latch.c
+++ /dev/null
@@ -1,349 +0,0 @@
-/*-------------------------------------------------------------------------
- *
- * win32_latch.c
- *	  Routines for inter-process latches
- *
- * See unix_latch.c for header comments for the exported functions;
- * the API presented here is supposed to be the same as there.
- *
- * The Windows implementation uses Windows events that are inherited by
- * all postmaster child processes.
- *
- * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
- * Portions Copyright (c) 1994, Regents of the University of California
- *
- * IDENTIFICATION
- *	  src/backend/port/win32_latch.c
- *
- *-------------------------------------------------------------------------
- */
-#include "postgres.h"
-
-#include <fcntl.h>
-#include <limits.h>
-#include <signal.h>
-#include <unistd.h>
-
-#include "miscadmin.h"
-#include "portability/instr_time.h"
-#include "postmaster/postmaster.h"
-#include "storage/barrier.h"
-#include "storage/latch.h"
-#include "storage/pmsignal.h"
-#include "storage/shmem.h"
-
-
-void
-InitializeLatchSupport(void)
-{
-	/* currently, nothing to do here for Windows */
-}
-
-void
-InitLatch(volatile Latch *latch)
-{
-	latch->is_set = false;
-	latch->owner_pid = MyProcPid;
-	latch->is_shared = false;
-
-	latch->event = CreateEvent(NULL, TRUE, FALSE, NULL);
-	if (latch->event == NULL)
-		elog(ERROR, "CreateEvent failed: error code %lu", GetLastError());
-}
-
-void
-InitSharedLatch(volatile Latch *latch)
-{
-	SECURITY_ATTRIBUTES sa;
-
-	latch->is_set = false;
-	latch->owner_pid = 0;
-	latch->is_shared = true;
-
-	/*
-	 * Set up security attributes to specify that the events are inherited.
-	 */
-	ZeroMemory(&sa, sizeof(sa));
-	sa.nLength = sizeof(sa);
-	sa.bInheritHandle = TRUE;
-
-	latch->event = CreateEvent(&sa, TRUE, FALSE, NULL);
-	if (latch->event == NULL)
-		elog(ERROR, "CreateEvent failed: error code %lu", GetLastError());
-}
-
-void
-OwnLatch(volatile Latch *latch)
-{
-	/* Sanity checks */
-	Assert(latch->is_shared);
-	if (latch->owner_pid != 0)
-		elog(ERROR, "latch already owned");
-
-	latch->owner_pid = MyProcPid;
-}
-
-void
-DisownLatch(volatile Latch *latch)
-{
-	Assert(latch->is_shared);
-	Assert(latch->owner_pid == MyProcPid);
-
-	latch->owner_pid = 0;
-}
-
-int
-WaitLatch(volatile Latch *latch, int wakeEvents, long timeout)
-{
-	return WaitLatchOrSocket(latch, wakeEvents, PGINVALID_SOCKET, timeout);
-}
-
-int
-WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
-				  long timeout)
-{
-	DWORD		rc;
-	instr_time	start_time,
-				cur_time;
-	long		cur_timeout;
-	HANDLE		events[4];
-	HANDLE		latchevent;
-	HANDLE		sockevent = WSA_INVALID_EVENT;
-	int			numevents;
-	int			result = 0;
-	int			pmdeath_eventno = 0;
-
-	Assert(wakeEvents != 0);	/* must have at least one wake event */
-
-	/* waiting for socket readiness without a socket indicates a bug */
-	if (sock == PGINVALID_SOCKET &&
-		(wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
-		elog(ERROR, "cannot wait on socket event without a socket");
-
-	if ((wakeEvents & WL_LATCH_SET) && latch->owner_pid != MyProcPid)
-		elog(ERROR, "cannot wait on a latch owned by another process");
-
-	/*
-	 * Initialize timeout if requested.  We must record the current time so
-	 * that we can determine the remaining timeout if WaitForMultipleObjects
-	 * is interrupted.
-	 */
-	if (wakeEvents & WL_TIMEOUT)
-	{
-		INSTR_TIME_SET_CURRENT(start_time);
-		Assert(timeout >= 0 && timeout <= INT_MAX);
-		cur_timeout = timeout;
-	}
-	else
-		cur_timeout = INFINITE;
-
-	/*
-	 * Construct an array of event handles for WaitforMultipleObjects().
-	 *
-	 * Note: pgwin32_signal_event should be first to ensure that it will be
-	 * reported when multiple events are set.  We want to guarantee that
-	 * pending signals are serviced.
-	 */
-	latchevent = latch->event;
-
-	events[0] = pgwin32_signal_event;
-	events[1] = latchevent;
-	numevents = 2;
-	if (wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))
-	{
-		/* Need an event object to represent events on the socket */
-		int			flags = FD_CLOSE;	/* always check for errors/EOF */
-
-		if (wakeEvents & WL_SOCKET_READABLE)
-			flags |= FD_READ;
-		if (wakeEvents & WL_SOCKET_WRITEABLE)
-			flags |= FD_WRITE;
-
-		sockevent = WSACreateEvent();
-		if (sockevent == WSA_INVALID_EVENT)
-			elog(ERROR, "failed to create event for socket: error code %u",
-				 WSAGetLastError());
-		if (WSAEventSelect(sock, sockevent, flags) != 0)
-			elog(ERROR, "failed to set up event for socket: error code %u",
-				 WSAGetLastError());
-
-		events[numevents++] = sockevent;
-	}
-	if (wakeEvents & WL_POSTMASTER_DEATH)
-	{
-		pmdeath_eventno = numevents;
-		events[numevents++] = PostmasterHandle;
-	}
-
-	/* Ensure that signals are serviced even if latch is already set */
-	pgwin32_dispatch_queued_signals();
-
-	do
-	{
-		/*
-		 * The comment in unix_latch.c's equivalent to this applies here as
-		 * well. At least after mentally replacing self-pipe with windows
-		 * event. There's no danger of overflowing, as "Setting an event that
-		 * is already set has no effect.".
-		 */
-		if ((wakeEvents & WL_LATCH_SET) && latch->is_set)
-		{
-			result |= WL_LATCH_SET;
-
-			/*
-			 * Leave loop immediately, avoid blocking again. We don't attempt
-			 * to report any other events that might also be satisfied.
-			 */
-			break;
-		}
-
-		rc = WaitForMultipleObjects(numevents, events, FALSE, cur_timeout);
-
-		if (rc == WAIT_FAILED)
-			elog(ERROR, "WaitForMultipleObjects() failed: error code %lu",
-				 GetLastError());
-		else if (rc == WAIT_TIMEOUT)
-		{
-			result |= WL_TIMEOUT;
-		}
-		else if (rc == WAIT_OBJECT_0)
-		{
-			/* Service newly-arrived signals */
-			pgwin32_dispatch_queued_signals();
-		}
-		else if (rc == WAIT_OBJECT_0 + 1)
-		{
-			/*
-			 * Reset the event.  We'll re-check the, potentially, set latch on
-			 * next iteration of loop, but let's not waste the cycles to
-			 * update cur_timeout below.
-			 */
-			if (!ResetEvent(latchevent))
-				elog(ERROR, "ResetEvent failed: error code %lu", GetLastError());
-
-			continue;
-		}
-		else if ((wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) &&
-				 rc == WAIT_OBJECT_0 + 2)		/* socket is at event slot 2 */
-		{
-			WSANETWORKEVENTS resEvents;
-
-			ZeroMemory(&resEvents, sizeof(resEvents));
-			if (WSAEnumNetworkEvents(sock, sockevent, &resEvents) != 0)
-				elog(ERROR, "failed to enumerate network events: error code %u",
-					 WSAGetLastError());
-			if ((wakeEvents & WL_SOCKET_READABLE) &&
-				(resEvents.lNetworkEvents & FD_READ))
-			{
-				result |= WL_SOCKET_READABLE;
-			}
-			if ((wakeEvents & WL_SOCKET_WRITEABLE) &&
-				(resEvents.lNetworkEvents & FD_WRITE))
-			{
-				result |= WL_SOCKET_WRITEABLE;
-			}
-			if (resEvents.lNetworkEvents & FD_CLOSE)
-			{
-				if (wakeEvents & WL_SOCKET_READABLE)
-					result |= WL_SOCKET_READABLE;
-				if (wakeEvents & WL_SOCKET_WRITEABLE)
-					result |= WL_SOCKET_WRITEABLE;
-			}
-		}
-		else if ((wakeEvents & WL_POSTMASTER_DEATH) &&
-				 rc == WAIT_OBJECT_0 + pmdeath_eventno)
-		{
-			/*
-			 * Postmaster apparently died.  Since the consequences of falsely
-			 * returning WL_POSTMASTER_DEATH could be pretty unpleasant, we
-			 * take the trouble to positively verify this with
-			 * PostmasterIsAlive(), even though there is no known reason to
-			 * think that the event could be falsely set on Windows.
-			 */
-			if (!PostmasterIsAlive())
-				result |= WL_POSTMASTER_DEATH;
-		}
-		else
-			elog(ERROR, "unexpected return code from WaitForMultipleObjects(): %lu", rc);
-
-		/* If we're not done, update cur_timeout for next iteration */
-		if (result == 0 && (wakeEvents & WL_TIMEOUT))
-		{
-			INSTR_TIME_SET_CURRENT(cur_time);
-			INSTR_TIME_SUBTRACT(cur_time, start_time);
-			cur_timeout = timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
-			if (cur_timeout <= 0)
-			{
-				/* Timeout has expired, no need to continue looping */
-				result |= WL_TIMEOUT;
-			}
-		}
-	} while (result == 0);
-
-	/* Clean up the event object we created for the socket */
-	if (sockevent != WSA_INVALID_EVENT)
-	{
-		WSAEventSelect(sock, NULL, 0);
-		WSACloseEvent(sockevent);
-	}
-
-	return result;
-}
-
-/*
- * The comments above the unix implementation (unix_latch.c) of this function
- * apply here as well.
- */
-void
-SetLatch(volatile Latch *latch)
-{
-	HANDLE		handle;
-
-	/*
-	 * The memory barrier has be to be placed here to ensure that any flag
-	 * variables possibly changed by this process have been flushed to main
-	 * memory, before we check/set is_set.
-	 */
-	pg_memory_barrier();
-
-	/* Quick exit if already set */
-	if (latch->is_set)
-		return;
-
-	latch->is_set = true;
-
-	/*
-	 * See if anyone's waiting for the latch. It can be the current process if
-	 * we're in a signal handler.
-	 *
-	 * Use a local variable here just in case somebody changes the event field
-	 * concurrently (which really should not happen).
-	 */
-	handle = latch->event;
-	if (handle)
-	{
-		SetEvent(handle);
-
-		/*
-		 * Note that we silently ignore any errors. We might be in a signal
-		 * handler or other critical path where it's not safe to call elog().
-		 */
-	}
-}
-
-void
-ResetLatch(volatile Latch *latch)
-{
-	/* Only the owner should reset the latch */
-	Assert(latch->owner_pid == MyProcPid);
-
-	latch->is_set = false;
-
-	/*
-	 * Ensure that the write to is_set gets flushed to main memory before we
-	 * examine any flag variables.  Otherwise a concurrent SetLatch might
-	 * falsely conclude that it needn't signal us, even though we have missed
-	 * seeing some flag updates that SetLatch was supposed to inform us of.
-	 */
-	pg_memory_barrier();
-}
diff --git a/src/backend/storage/ipc/Makefile b/src/backend/storage/ipc/Makefile
index d8eb742..8a55392 100644
--- a/src/backend/storage/ipc/Makefile
+++ b/src/backend/storage/ipc/Makefile
@@ -8,7 +8,8 @@ subdir = src/backend/storage/ipc
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = dsm_impl.o dsm.o ipc.o ipci.o pmsignal.o procarray.o procsignal.o \
-	shmem.o shmqueue.o shm_mq.o shm_toc.o sinval.o sinvaladt.o standby.o
+OBJS = dsm_impl.o dsm.o ipc.o ipci.o latch.o pmsignal.o procarray.o \
+	procsignal.o  shmem.o shmqueue.o shm_mq.o shm_toc.o sinval.o \
+	sinvaladt.o standby.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/port/unix_latch.c b/src/backend/storage/ipc/latch.c
similarity index 74%
rename from src/backend/port/unix_latch.c
rename to src/backend/storage/ipc/latch.c
index 63b76c6..d42c9c6 100644
--- a/src/backend/port/unix_latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -1,6 +1,6 @@
 /*-------------------------------------------------------------------------
  *
- * unix_latch.c
+ * latch.c
  *	  Routines for inter-process latches
  *
  * The Unix implementation uses the so-called self-pipe trick to overcome
@@ -22,11 +22,14 @@
  * process, SIGUSR1 is sent and the signal handler in the waiting process
  * writes the byte to the pipe on behalf of the signaling process.
  *
+ * The Windows implementation uses Windows events that are inherited by
+ * all postmaster child processes.
+ *
  * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
  * IDENTIFICATION
- *	  src/backend/port/unix_latch.c
+ *	  src/backend/storage/ipc/latch.c
  *
  *-------------------------------------------------------------------------
  */
@@ -62,16 +65,20 @@
  * useful to manually specify the used primitive.  If desired, just add a
  * define somewhere before this block.
  */
-#if defined(LATCH_USE_POLL) || defined(LATCH_USE_SELECT)
+#if defined(LATCH_USE_POLL) || defined(LATCH_USE_SELECT) \
+	|| defined(LATCH_USE_WIN32)
 /* don't overwrite manual choice */
 #elif defined(HAVE_POLL)
 #define LATCH_USE_POLL
 #elif HAVE_SYS_SELECT_H
 #define LATCH_USE_SELECT
+#elif WIN32
+#define LATCH_USE_WIN32
 #else
 #error "no latch implementation available"
 #endif
 
+#ifndef WIN32
 /* Are we currently in WaitLatch? The signal handler would like to know. */
 static volatile sig_atomic_t waiting = false;
 
@@ -82,6 +89,7 @@ static int	selfpipe_writefd = -1;
 /* Private function prototypes */
 static void sendSelfPipeByte(void);
 static void drainSelfPipe(void);
+#endif   /* WIN32 */
 
 
 /*
@@ -93,6 +101,7 @@ static void drainSelfPipe(void);
 void
 InitializeLatchSupport(void)
 {
+#ifndef WIN32
 	int			pipefd[2];
 
 	Assert(selfpipe_readfd == -1);
@@ -113,6 +122,9 @@ InitializeLatchSupport(void)
 
 	selfpipe_readfd = pipefd[0];
 	selfpipe_writefd = pipefd[1];
+#else
+	/* currently, nothing to do here for Windows */
+#endif
 }
 
 /*
@@ -121,12 +133,18 @@ InitializeLatchSupport(void)
 void
 InitLatch(volatile Latch *latch)
 {
-	/* Assert InitializeLatchSupport has been called in this process */
-	Assert(selfpipe_readfd >= 0);
-
 	latch->is_set = false;
 	latch->owner_pid = MyProcPid;
 	latch->is_shared = false;
+
+#ifndef WIN32
+	/* Assert InitializeLatchSupport has been called in this process */
+	Assert(selfpipe_readfd >= 0);
+#else
+	latch->event = CreateEvent(NULL, TRUE, FALSE, NULL);
+	if (latch->event == NULL)
+		elog(ERROR, "CreateEvent failed: error code %lu", GetLastError());
+#endif   /* WIN32 */
 }
 
 /*
@@ -143,6 +161,21 @@ InitLatch(volatile Latch *latch)
 void
 InitSharedLatch(volatile Latch *latch)
 {
+#ifdef WIN32
+	SECURITY_ATTRIBUTES sa;
+
+	/*
+	 * Set up security attributes to specify that the events are inherited.
+	 */
+	ZeroMemory(&sa, sizeof(sa));
+	sa.nLength = sizeof(sa);
+	sa.bInheritHandle = TRUE;
+
+	latch->event = CreateEvent(&sa, TRUE, FALSE, NULL);
+	if (latch->event == NULL)
+		elog(ERROR, "CreateEvent failed: error code %lu", GetLastError());
+#endif
+
 	latch->is_set = false;
 	latch->owner_pid = 0;
 	latch->is_shared = true;
@@ -164,12 +197,14 @@ InitSharedLatch(volatile Latch *latch)
 void
 OwnLatch(volatile Latch *latch)
 {
-	/* Assert InitializeLatchSupport has been called in this process */
-	Assert(selfpipe_readfd >= 0);
-
+	/* Sanity checks */
 	Assert(latch->is_shared);
 
-	/* sanity check */
+#ifndef WIN32
+	/* Assert InitializeLatchSupport has been called in this process */
+	Assert(selfpipe_readfd >= 0);
+#endif
+
 	if (latch->owner_pid != 0)
 		elog(ERROR, "latch already owned");
 
@@ -221,6 +256,7 @@ WaitLatch(volatile Latch *latch, int wakeEvents, long timeout)
  * returning the socket as readable/writable or both, depending on
  * WL_SOCKET_READABLE/WL_SOCKET_WRITEABLE being specified.
  */
+#ifndef LATCH_USE_WIN32
 int
 WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 				  long timeout)
@@ -551,6 +587,199 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 
 	return result;
 }
+#else							/* LATCH_USE_WIN32 */
+int
+WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
+				  long timeout)
+{
+	DWORD		rc;
+	instr_time	start_time,
+				cur_time;
+	long		cur_timeout;
+	HANDLE		events[4];
+	HANDLE		latchevent;
+	HANDLE		sockevent = WSA_INVALID_EVENT;
+	int			numevents;
+	int			result = 0;
+	int			pmdeath_eventno = 0;
+
+	Assert(wakeEvents != 0);	/* must have at least one wake event */
+
+	/* waiting for socket readiness without a socket indicates a bug */
+	if (sock == PGINVALID_SOCKET &&
+		(wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
+		elog(ERROR, "cannot wait on socket event without a socket");
+
+	if ((wakeEvents & WL_LATCH_SET) && latch->owner_pid != MyProcPid)
+		elog(ERROR, "cannot wait on a latch owned by another process");
+
+	/*
+	 * Initialize timeout if requested.  We must record the current time so
+	 * that we can determine the remaining timeout if WaitForMultipleObjects
+	 * is interrupted.
+	 */
+	if (wakeEvents & WL_TIMEOUT)
+	{
+		INSTR_TIME_SET_CURRENT(start_time);
+		Assert(timeout >= 0 && timeout <= INT_MAX);
+		cur_timeout = timeout;
+	}
+	else
+		cur_timeout = INFINITE;
+
+	/*
+	 * Construct an array of event handles for WaitforMultipleObjects().
+	 *
+	 * Note: pgwin32_signal_event should be first to ensure that it will be
+	 * reported when multiple events are set.  We want to guarantee that
+	 * pending signals are serviced.
+	 */
+	latchevent = latch->event;
+
+	events[0] = pgwin32_signal_event;
+	events[1] = latchevent;
+	numevents = 2;
+	if (wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))
+	{
+		/* Need an event object to represent events on the socket */
+		int			flags = FD_CLOSE;	/* always check for errors/EOF */
+
+		if (wakeEvents & WL_SOCKET_READABLE)
+			flags |= FD_READ;
+		if (wakeEvents & WL_SOCKET_WRITEABLE)
+			flags |= FD_WRITE;
+
+		sockevent = WSACreateEvent();
+		if (sockevent == WSA_INVALID_EVENT)
+			elog(ERROR, "failed to create event for socket: error code %u",
+				 WSAGetLastError());
+		if (WSAEventSelect(sock, sockevent, flags) != 0)
+			elog(ERROR, "failed to set up event for socket: error code %u",
+				 WSAGetLastError());
+
+		events[numevents++] = sockevent;
+	}
+	if (wakeEvents & WL_POSTMASTER_DEATH)
+	{
+		pmdeath_eventno = numevents;
+		events[numevents++] = PostmasterHandle;
+	}
+
+	/* Ensure that signals are serviced even if latch is already set */
+	pgwin32_dispatch_queued_signals();
+
+	do
+	{
+		/*
+		 * The comment in the unix version above applies here as well. At
+		 * least after mentally replacing self-pipe with windows event.
+		 * There's no danger of overflowing, as "Setting an event that is
+		 * already set has no effect.".
+		 */
+		if ((wakeEvents & WL_LATCH_SET) && latch->is_set)
+		{
+			result |= WL_LATCH_SET;
+
+			/*
+			 * Leave loop immediately, avoid blocking again. We don't attempt
+			 * to report any other events that might also be satisfied.
+			 */
+			break;
+		}
+
+		rc = WaitForMultipleObjects(numevents, events, FALSE, cur_timeout);
+
+		if (rc == WAIT_FAILED)
+			elog(ERROR, "WaitForMultipleObjects() failed: error code %lu",
+				 GetLastError());
+		else if (rc == WAIT_TIMEOUT)
+		{
+			result |= WL_TIMEOUT;
+		}
+		else if (rc == WAIT_OBJECT_0)
+		{
+			/* Service newly-arrived signals */
+			pgwin32_dispatch_queued_signals();
+		}
+		else if (rc == WAIT_OBJECT_0 + 1)
+		{
+			/*
+			 * Reset the event.  We'll re-check the, potentially, set latch on
+			 * next iteration of loop, but let's not waste the cycles to
+			 * update cur_timeout below.
+			 */
+			if (!ResetEvent(latchevent))
+				elog(ERROR, "ResetEvent failed: error code %lu", GetLastError());
+
+			continue;
+		}
+		else if ((wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) &&
+				 rc == WAIT_OBJECT_0 + 2)		/* socket is at event slot 2 */
+		{
+			WSANETWORKEVENTS resEvents;
+
+			ZeroMemory(&resEvents, sizeof(resEvents));
+			if (WSAEnumNetworkEvents(sock, sockevent, &resEvents) != 0)
+				elog(ERROR, "failed to enumerate network events: error code %u",
+					 WSAGetLastError());
+			if ((wakeEvents & WL_SOCKET_READABLE) &&
+				(resEvents.lNetworkEvents & FD_READ))
+			{
+				result |= WL_SOCKET_READABLE;
+			}
+			if ((wakeEvents & WL_SOCKET_WRITEABLE) &&
+				(resEvents.lNetworkEvents & FD_WRITE))
+			{
+				result |= WL_SOCKET_WRITEABLE;
+			}
+			if (resEvents.lNetworkEvents & FD_CLOSE)
+			{
+				if (wakeEvents & WL_SOCKET_READABLE)
+					result |= WL_SOCKET_READABLE;
+				if (wakeEvents & WL_SOCKET_WRITEABLE)
+					result |= WL_SOCKET_WRITEABLE;
+			}
+		}
+		else if ((wakeEvents & WL_POSTMASTER_DEATH) &&
+				 rc == WAIT_OBJECT_0 + pmdeath_eventno)
+		{
+			/*
+			 * Postmaster apparently died.  Since the consequences of falsely
+			 * returning WL_POSTMASTER_DEATH could be pretty unpleasant, we
+			 * take the trouble to positively verify this with
+			 * PostmasterIsAlive(), even though there is no known reason to
+			 * think that the event could be falsely set on Windows.
+			 */
+			if (!PostmasterIsAlive())
+				result |= WL_POSTMASTER_DEATH;
+		}
+		else
+			elog(ERROR, "unexpected return code from WaitForMultipleObjects(): %lu", rc);
+
+		/* If we're not done, update cur_timeout for next iteration */
+		if (result == 0 && (wakeEvents & WL_TIMEOUT))
+		{
+			INSTR_TIME_SET_CURRENT(cur_time);
+			INSTR_TIME_SUBTRACT(cur_time, start_time);
+			cur_timeout = timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
+			if (cur_timeout <= 0)
+			{
+				/* Timeout has expired, no need to continue looping */
+				result |= WL_TIMEOUT;
+			}
+		}
+	} while (result == 0);
+
+	/* Clean up the event object we created for the socket */
+	if (sockevent != WSA_INVALID_EVENT)
+	{
+		WSAEventSelect(sock, NULL, 0);
+		WSACloseEvent(sockevent);
+	}
+
+	return result;
+}
+#endif   /* LATCH_USE_WIN32 */
 
 /*
  * Sets a latch and wakes up anyone waiting on it.
@@ -567,7 +796,11 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 void
 SetLatch(volatile Latch *latch)
 {
+#ifndef WIN32
 	pid_t		owner_pid;
+#else
+	HANDLE		handle;
+#endif
 
 	/*
 	 * The memory barrier has be to be placed here to ensure that any flag
@@ -582,6 +815,8 @@ SetLatch(volatile Latch *latch)
 
 	latch->is_set = true;
 
+#ifndef WIN32
+
 	/*
 	 * See if anyone's waiting for the latch. It can be the current process if
 	 * we're in a signal handler. We use the self-pipe to wake up the select()
@@ -613,6 +848,27 @@ SetLatch(volatile Latch *latch)
 	}
 	else
 		kill(owner_pid, SIGUSR1);
+#else
+
+	/*
+	 * See if anyone's waiting for the latch. It can be the current process if
+	 * we're in a signal handler.
+	 *
+	 * Use a local variable here just in case somebody changes the event field
+	 * concurrently (which really should not happen).
+	 */
+	handle = latch->event;
+	if (handle)
+	{
+		SetEvent(handle);
+
+		/*
+		 * Note that we silently ignore any errors. We might be in a signal
+		 * handler or other critical path where it's not safe to call elog().
+		 */
+	}
+#endif
+
 }
 
 /*
@@ -646,14 +902,17 @@ ResetLatch(volatile Latch *latch)
  * NB: when calling this in a signal handler, be sure to save and restore
  * errno around it.
  */
+#ifndef WIN32
 void
 latch_sigusr1_handler(void)
 {
 	if (waiting)
 		sendSelfPipeByte();
 }
+#endif   /* !WIN32 */
 
 /* Send one byte to the self-pipe, to wake up WaitLatch */
+#ifndef WIN32
 static void
 sendSelfPipeByte(void)
 {
@@ -683,6 +942,7 @@ retry:
 		return;
 	}
 }
+#endif   /* !WIN32 */
 
 /*
  * Read all available data from the self-pipe
@@ -691,6 +951,7 @@ retry:
  * return, it must reset that flag first (though ideally, this will never
  * happen).
  */
+#ifndef WIN32
 static void
 drainSelfPipe(void)
 {
@@ -729,3 +990,4 @@ drainSelfPipe(void)
 		/* else buffer wasn't big enough, so read again */
 	}
 }
+#endif   /* !WIN32 */
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 737e11d..1b9521f 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -36,7 +36,7 @@
  * WaitLatch includes a provision for timeouts (which should be avoided
  * when possible, as they incur extra overhead) and a provision for
  * postmaster child processes to wake up immediately on postmaster death.
- * See unix_latch.c for detailed specifications for the exported functions.
+ * See latch.c for detailed specifications for the exported functions.
  *
  * The correct pattern to wait for event(s) is:
  *
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index ab65fa3..012b327 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -136,8 +136,6 @@ sub mkvcbuild
 		'src/backend/port/win32_sema.c');
 	$postgres->ReplaceFile('src/backend/port/pg_shmem.c',
 		'src/backend/port/win32_shmem.c');
-	$postgres->ReplaceFile('src/backend/port/pg_latch.c',
-		'src/backend/port/win32_latch.c');
 	$postgres->AddFiles('src/port',   @pgportfiles);
 	$postgres->AddFiles('src/common', @pgcommonbkndfiles);
 	$postgres->AddDir('src/timezone');
-- 
2.7.0.229.g701fa7f

0002-Introduce-WaitEventSet-API.patchtext/x-patch; charset=us-asciiDownload

From af5fbbaab953e0b41ec4e661a3cdabd7666473c8 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 21 Mar 2016 09:56:39 +0100
Subject: [PATCH 2/2] Introduce WaitEventSet API.

Commit ac1d794 ("Make idle backends exit if the postmaster dies.")
introduced a regression on, at least, large linux systems. Constantly
adding the same postmaster_alive_fds to the OSs internal datastructures
for implementing poll/select can cause significant contention; leading
to a performance regression of nearly 3x in one example.

This can be avoided by using e.g. linux' epoll, which avoids having to
add/remove file descriptors to the wait datastructures at a high rate.
Unfortunately the current latch interface makes it hard to allocate any
persistent per-backend resources.

Replace, with a backward compatibility layer, WaitLatchOrSocket with a
new WaitEventSet API. Users can allocate such a Set across multiple
calls, and add more than one file-descriptor to wait on. The latter has
been added because there's upcoming postgres features where that will be
helpful.

In addition to the previously existing poll(2), select(2),
WaitForMultipleObjects() implementations also provide an epoll_wait(2)
based implementation to address the aforementioned performance
problem. Epoll is only available on linux, but that is the most likely
OS for machines large enough (four sockets) to reproduce the problem.

To actually address the aforementioned regression, create and use a
long-lived WaitEventSet for FE/BE communication.  There are additional
places that would benefit from a long-lived set, but that's a task for
another day.

Thanks to Amit Kapila, who helped make the windows code I blindly wrote
actually work.

Reported-By: Dmitry Vasilyev Discussion:
CAB-SwXZh44_2ybvS5Z67p_CDz=XFn4hNAD=CnMEF+QqkXwFrGg@mail.gmail.com
20160114143931.GG10941@awork2.anarazel.de
---
 configure                         |    2 +-
 configure.in                      |    2 +-
 src/backend/libpq/be-secure.c     |   24 +-
 src/backend/libpq/pqcomm.c        |    5 +
 src/backend/storage/ipc/latch.c   | 1614 +++++++++++++++++++++++++------------
 src/backend/utils/init/miscinit.c |    8 +
 src/include/libpq/libpq.h         |    3 +
 src/include/pg_config.h.in        |    3 +
 src/include/storage/latch.h       |   37 +-
 src/tools/pgindent/typedefs.list  |    2 +
 10 files changed, 1170 insertions(+), 530 deletions(-)

diff --git a/configure b/configure
index c10d954..24655dc 100755
--- a/configure
+++ b/configure
@@ -10193,7 +10193,7 @@ fi
 ## Header files
 ##
 
-for ac_header in atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h pwd.h sys/ioctl.h sys/ipc.h sys/poll.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/socket.h sys/sockio.h sys/tas.h sys/time.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h
+for ac_header in atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h pwd.h sys/epoll.h sys/ioctl.h sys/ipc.h sys/poll.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/socket.h sys/sockio.h sys/tas.h sys/time.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h
 do :
   as_ac_Header=`$as_echo "ac_cv_header_$ac_header" | $as_tr_sh`
 ac_fn_c_check_header_mongrel "$LINENO" "$ac_header" "$as_ac_Header" "$ac_includes_default"
diff --git a/configure.in b/configure.in
index 47d0f58..c564a76 100644
--- a/configure.in
+++ b/configure.in
@@ -1183,7 +1183,7 @@ AC_SUBST(UUID_LIBS)
 ##
 
 dnl sys/socket.h is required by AC_FUNC_ACCEPT_ARGTYPES
-AC_CHECK_HEADERS([atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h pwd.h sys/ioctl.h sys/ipc.h sys/poll.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/socket.h sys/sockio.h sys/tas.h sys/time.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h])
+AC_CHECK_HEADERS([atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h pwd.h sys/epoll.h sys/ioctl.h sys/ipc.h sys/poll.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/socket.h sys/sockio.h sys/tas.h sys/time.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h])
 
 # On BSD, test for net/if.h will fail unless sys/socket.h
 # is included first.
diff --git a/src/backend/libpq/be-secure.c b/src/backend/libpq/be-secure.c
index ac709d1..29297e7 100644
--- a/src/backend/libpq/be-secure.c
+++ b/src/backend/libpq/be-secure.c
@@ -140,13 +140,13 @@ retry:
 	/* In blocking mode, wait until the socket is ready */
 	if (n < 0 && !port->noblock && (errno == EWOULDBLOCK || errno == EAGAIN))
 	{
-		int			w;
+		WaitEvent   event;
 
 		Assert(waitfor);
 
-		w = WaitLatchOrSocket(MyLatch,
-							  WL_LATCH_SET | WL_POSTMASTER_DEATH | waitfor,
-							  port->sock, 0);
+		ModifyWaitEvent(FeBeWaitSet, 0, waitfor, NULL);
+
+		WaitEventSetWait(FeBeWaitSet, -1 /* no timeout */, &event, 1);
 
 		/*
 		 * If the postmaster has died, it's not safe to continue running,
@@ -165,13 +165,13 @@ retry:
 		 * cycles checking for this very rare condition, and this should cause
 		 * us to exit quickly in most cases.)
 		 */
-		if (w & WL_POSTMASTER_DEATH)
+		if (event.events & WL_POSTMASTER_DEATH)
 			ereport(FATAL,
 					(errcode(ERRCODE_ADMIN_SHUTDOWN),
 					errmsg("terminating connection due to unexpected postmaster exit")));
 
 		/* Handle interrupt. */
-		if (w & WL_LATCH_SET)
+		if (event.events & WL_LATCH_SET)
 		{
 			ResetLatch(MyLatch);
 			ProcessClientReadInterrupt(true);
@@ -241,22 +241,22 @@ retry:
 
 	if (n < 0 && !port->noblock && (errno == EWOULDBLOCK || errno == EAGAIN))
 	{
-		int			w;
+		WaitEvent   event;
 
 		Assert(waitfor);
 
-		w = WaitLatchOrSocket(MyLatch,
-							  WL_LATCH_SET | WL_POSTMASTER_DEATH | waitfor,
-							  port->sock, 0);
+		ModifyWaitEvent(FeBeWaitSet, 0, waitfor, NULL);
+
+		WaitEventSetWait(FeBeWaitSet, -1 /* no timeout */, &event, 1);
 
 		/* See comments in secure_read. */
-		if (w & WL_POSTMASTER_DEATH)
+		if (event.events & WL_POSTMASTER_DEATH)
 			ereport(FATAL,
 					(errcode(ERRCODE_ADMIN_SHUTDOWN),
 					errmsg("terminating connection due to unexpected postmaster exit")));
 
 		/* Handle interrupt. */
-		if (w & WL_LATCH_SET)
+		if (event.events & WL_LATCH_SET)
 		{
 			ResetLatch(MyLatch);
 			ProcessClientWriteInterrupt(true);
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 71473db..acd005e 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -201,6 +201,11 @@ pq_init(void)
 				(errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
+	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+	AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
+					  NULL, NULL);
+	AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
+	AddWaitEventToSet(FeBeWaitSet, WL_POSTMASTER_DEATH, -1, NULL, NULL);
 }
 
 /* --------------------------------
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index d42c9c6..c33e325 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -14,8 +14,8 @@
  * however reliably interrupts the sleep, and causes select() to return
  * immediately even if the signal arrives before select() begins.
  *
- * (Actually, we prefer poll() over select() where available, but the
- * same comments apply to it.)
+ * (Actually, we prefer epoll_wait() over poll() over select() where
+ * available, but the same comments apply.)
  *
  * When SetLatch is called from the same process that owns the latch,
  * SetLatch writes the byte directly to the pipe. If it's owned by another
@@ -41,6 +41,9 @@
 #include <unistd.h>
 #include <sys/time.h>
 #include <sys/types.h>
+#ifdef HAVE_SYS_EPOLL_H
+#include <sys/epoll.h>
+#endif
 #ifdef HAVE_POLL_H
 #include <poll.h>
 #endif
@@ -65,19 +68,60 @@
  * useful to manually specify the used primitive.  If desired, just add a
  * define somewhere before this block.
  */
-#if defined(LATCH_USE_POLL) || defined(LATCH_USE_SELECT) \
-	|| defined(LATCH_USE_WIN32)
+#if defined(WAIT_USE_EPOLL) || defined(WAIT_USE_POLL) || \
+	defined(WAIT_USE_SELECT) || defined(WAIT_USE_WIN32)
 /* don't overwrite manual choice */
+#elif defined(HAVE_SYS_EPOLL_H)
+#define WAIT_USE_EPOLL
 #elif defined(HAVE_POLL)
-#define LATCH_USE_POLL
+#define WAIT_USE_POLL
 #elif HAVE_SYS_SELECT_H
-#define LATCH_USE_SELECT
+#define WAIT_USE_SELECT
 #elif WIN32
-#define LATCH_USE_WIN32
+#define WAIT_USE_WIN32
 #else
-#error "no latch implementation available"
+#error "no wait set implementation available"
 #endif
 
+/* typedef in latch.h */
+struct WaitEventSet
+{
+	int			nevents;		/* number of registered events */
+	int			nevents_space;	/* maximum number of events in this set */
+
+	/*
+	 * Array, of nevents_space length, storing the definition of events this
+	 * set is waiting for.
+	 */
+	WaitEvent  *events;
+
+	/*
+	 * If WL_LATCH_SET is specified in any wait event, latch is a pointer to
+	 * said latch, and latch_pos the offset in the ->events array. This is
+	 * useful because we check the state of the latch before performing doing
+	 * syscalls related to waiting.
+	 */
+	Latch	   *latch;
+	int			latch_pos;
+
+#if defined(WAIT_USE_EPOLL)
+	int			epoll_fd;
+	/* epoll_wait returns events in a user provided arrays, allocate once */
+	struct epoll_event *epoll_ret_events;
+#elif defined(WAIT_USE_POLL)
+	/* poll expects events to be waited on every poll() call, prepare once */
+	struct pollfd *pollfds;
+#elif defined(WAIT_USE_WIN32)
+
+	/*
+	 * Array of windows events. The first element always contains
+	 * pgwin32_signal_event, so the remaining elements are offset by one (i.e.
+	 * event->pos + 1).
+	 */
+	HANDLE	   *handles;
+#endif
+};
+
 #ifndef WIN32
 /* Are we currently in WaitLatch? The signal handler would like to know. */
 static volatile sig_atomic_t waiting = false;
@@ -91,6 +135,16 @@ static void sendSelfPipeByte(void);
 static void drainSelfPipe(void);
 #endif   /* WIN32 */
 
+#if defined(WAIT_USE_EPOLL)
+static void WaitEventAdjustEpoll(WaitEventSet *set, WaitEvent *event, int action);
+#elif defined(WAIT_USE_POLL)
+static void WaitEventAdjustPoll(WaitEventSet *set, WaitEvent *event);
+#elif defined(WAIT_USE_WIN32)
+static void WaitEventAdjustWin32(WaitEventSet *set, WaitEvent *event);
+#endif
+
+static int WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
+					  WaitEvent *occurred_events, int nevents);
 
 /*
  * Initialize the process-local latch infrastructure.
@@ -255,531 +309,57 @@ WaitLatch(volatile Latch *latch, int wakeEvents, long timeout)
  * When waiting on a socket, EOF and error conditions are reported by
  * returning the socket as readable/writable or both, depending on
  * WL_SOCKET_READABLE/WL_SOCKET_WRITEABLE being specified.
+ *
+ * NB: These days this is just a wrapper around the WaitEventSet API. When
+ * using a latch very frequently, consider creating a longer living
+ * WaitEventSet instead; that's more efficient.
  */
-#ifndef LATCH_USE_WIN32
 int
 WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 				  long timeout)
 {
-	int			result = 0;
+	int			ret = 0;
 	int			rc;
-	instr_time	start_time,
-				cur_time;
-	long		cur_timeout;
+	WaitEvent	event;
+	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
 
-#if defined(LATCH_USE_POLL)
-	struct pollfd pfds[3];
-	int			nfds;
-#elif defined(LATCH_USE_SELECT)
-	struct timeval tv,
-			   *tvp;
-	fd_set		input_mask;
-	fd_set		output_mask;
-	int			hifd;
-#endif
-
-	Assert(wakeEvents != 0);	/* must have at least one wake event */
-
-	/* waiting for socket readiness without a socket indicates a bug */
-	if (sock == PGINVALID_SOCKET &&
-		(wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
-		elog(ERROR, "cannot wait on socket event without a socket");
-
-	if ((wakeEvents & WL_LATCH_SET) && latch->owner_pid != MyProcPid)
-		elog(ERROR, "cannot wait on a latch owned by another process");
-
-	/*
-	 * Initialize timeout if requested.  We must record the current time so
-	 * that we can determine the remaining timeout if the poll() or select()
-	 * is interrupted.  (On some platforms, select() will update the contents
-	 * of "tv" for us, but unfortunately we can't rely on that.)
-	 */
 	if (wakeEvents & WL_TIMEOUT)
-	{
-		INSTR_TIME_SET_CURRENT(start_time);
-		Assert(timeout >= 0 && timeout <= INT_MAX);
-		cur_timeout = timeout;
-
-#ifdef LATCH_USE_SELECT
-		tv.tv_sec = cur_timeout / 1000L;
-		tv.tv_usec = (cur_timeout % 1000L) * 1000L;
-		tvp = &tv;
-#endif
-	}
+		Assert(timeout >= 0);
 	else
-	{
-		cur_timeout = -1;
+		timeout = -1;
 
-#ifdef LATCH_USE_SELECT
-		tvp = NULL;
-#endif
-	}
+	if (wakeEvents & WL_LATCH_SET)
+		AddWaitEventToSet(set, WL_LATCH_SET, PGINVALID_SOCKET,
+						  (Latch *) latch, NULL);
 
-	waiting = true;
-	do
-	{
-		/*
-		 * Check if the latch is set already. If so, leave loop immediately,
-		 * avoid blocking again. We don't attempt to report any other events
-		 * that might also be satisfied.
-		 *
-		 * If someone sets the latch between this and the poll()/select()
-		 * below, the setter will write a byte to the pipe (or signal us and
-		 * the signal handler will do that), and the poll()/select() will
-		 * return immediately.
-		 *
-		 * If there's a pending byte in the self pipe, we'll notice whenever
-		 * blocking. Only clearing the pipe in that case avoids having to
-		 * drain it every time WaitLatchOrSocket() is used. Should the
-		 * pipe-buffer fill up we're still ok, because the pipe is in
-		 * nonblocking mode. It's unlikely for that to happen, because the
-		 * self pipe isn't filled unless we're blocking (waiting = true), or
-		 * from inside a signal handler in latch_sigusr1_handler().
-		 *
-		 * Note: we assume that the kernel calls involved in drainSelfPipe()
-		 * and SetLatch() will provide adequate synchronization on machines
-		 * with weak memory ordering, so that we cannot miss seeing is_set if
-		 * the signal byte is already in the pipe when we drain it.
-		 */
-		if ((wakeEvents & WL_LATCH_SET) && latch->is_set)
-		{
-			result |= WL_LATCH_SET;
-			break;
-		}
+	if (wakeEvents & WL_POSTMASTER_DEATH)
+		AddWaitEventToSet(set, WL_POSTMASTER_DEATH, PGINVALID_SOCKET,
+						  NULL, NULL);
 
-		/*
-		 * Must wait ... we use the polling interface determined at the top of
-		 * this file to do so.
-		 */
-#if defined(LATCH_USE_POLL)
-		nfds = 0;
-
-		/* selfpipe is always in pfds[0] */
-		pfds[0].fd = selfpipe_readfd;
-		pfds[0].events = POLLIN;
-		pfds[0].revents = 0;
-		nfds++;
-
-		if (wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))
-		{
-			/* socket, if used, is always in pfds[1] */
-			pfds[1].fd = sock;
-			pfds[1].events = 0;
-			if (wakeEvents & WL_SOCKET_READABLE)
-				pfds[1].events |= POLLIN;
-			if (wakeEvents & WL_SOCKET_WRITEABLE)
-				pfds[1].events |= POLLOUT;
-			pfds[1].revents = 0;
-			nfds++;
-		}
-
-		if (wakeEvents & WL_POSTMASTER_DEATH)
-		{
-			/* postmaster fd, if used, is always in pfds[nfds - 1] */
-			pfds[nfds].fd = postmaster_alive_fds[POSTMASTER_FD_WATCH];
-			pfds[nfds].events = POLLIN;
-			pfds[nfds].revents = 0;
-			nfds++;
-		}
-
-		/* Sleep */
-		rc = poll(pfds, nfds, (int) cur_timeout);
-
-		/* Check return code */
-		if (rc < 0)
-		{
-			/* EINTR is okay, otherwise complain */
-			if (errno != EINTR)
-			{
-				waiting = false;
-				ereport(ERROR,
-						(errcode_for_socket_access(),
-						 errmsg("poll() failed: %m")));
-			}
-		}
-		else if (rc == 0)
-		{
-			/* timeout exceeded */
-			if (wakeEvents & WL_TIMEOUT)
-				result |= WL_TIMEOUT;
-		}
-		else
-		{
-			/* at least one event occurred, so check revents values */
-
-			if (pfds[0].revents & POLLIN)
-			{
-				/* There's data in the self-pipe, clear it. */
-				drainSelfPipe();
-			}
-
-			if ((wakeEvents & WL_SOCKET_READABLE) &&
-				(pfds[1].revents & POLLIN))
-			{
-				/* data available in socket, or EOF/error condition */
-				result |= WL_SOCKET_READABLE;
-			}
-			if ((wakeEvents & WL_SOCKET_WRITEABLE) &&
-				(pfds[1].revents & POLLOUT))
-			{
-				/* socket is writable */
-				result |= WL_SOCKET_WRITEABLE;
-			}
-			if ((wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) &&
-				(pfds[1].revents & (POLLHUP | POLLERR | POLLNVAL)))
-			{
-				/* EOF/error condition */
-				if (wakeEvents & WL_SOCKET_READABLE)
-					result |= WL_SOCKET_READABLE;
-				if (wakeEvents & WL_SOCKET_WRITEABLE)
-					result |= WL_SOCKET_WRITEABLE;
-			}
-
-			/*
-			 * We expect a POLLHUP when the remote end is closed, but because
-			 * we don't expect the pipe to become readable or to have any
-			 * errors either, treat those cases as postmaster death, too.
-			 */
-			if ((wakeEvents & WL_POSTMASTER_DEATH) &&
-				(pfds[nfds - 1].revents & (POLLHUP | POLLIN | POLLERR | POLLNVAL)))
-			{
-				/*
-				 * According to the select(2) man page on Linux, select(2) may
-				 * spuriously return and report a file descriptor as readable,
-				 * when it's not; and presumably so can poll(2).  It's not
-				 * clear that the relevant cases would ever apply to the
-				 * postmaster pipe, but since the consequences of falsely
-				 * returning WL_POSTMASTER_DEATH could be pretty unpleasant,
-				 * we take the trouble to positively verify EOF with
-				 * PostmasterIsAlive().
-				 */
-				if (!PostmasterIsAlive())
-					result |= WL_POSTMASTER_DEATH;
-			}
-		}
-#elif defined(LATCH_USE_SELECT)
-
-		/*
-		 * On at least older linux kernels select(), in violation of POSIX,
-		 * doesn't reliably return a socket as writable if closed - but we
-		 * rely on that. So far all the known cases of this problem are on
-		 * platforms that also provide a poll() implementation without that
-		 * bug.  If we find one where that's not the case, we'll need to add a
-		 * workaround.
-		 */
-		FD_ZERO(&input_mask);
-		FD_ZERO(&output_mask);
-
-		FD_SET(selfpipe_readfd, &input_mask);
-		hifd = selfpipe_readfd;
-
-		if (wakeEvents & WL_POSTMASTER_DEATH)
-		{
-			FD_SET(postmaster_alive_fds[POSTMASTER_FD_WATCH], &input_mask);
-			if (postmaster_alive_fds[POSTMASTER_FD_WATCH] > hifd)
-				hifd = postmaster_alive_fds[POSTMASTER_FD_WATCH];
-		}
-
-		if (wakeEvents & WL_SOCKET_READABLE)
-		{
-			FD_SET(sock, &input_mask);
-			if (sock > hifd)
-				hifd = sock;
-		}
-
-		if (wakeEvents & WL_SOCKET_WRITEABLE)
-		{
-			FD_SET(sock, &output_mask);
-			if (sock > hifd)
-				hifd = sock;
-		}
-
-		/* Sleep */
-		rc = select(hifd + 1, &input_mask, &output_mask, NULL, tvp);
-
-		/* Check return code */
-		if (rc < 0)
-		{
-			/* EINTR is okay, otherwise complain */
-			if (errno != EINTR)
-			{
-				waiting = false;
-				ereport(ERROR,
-						(errcode_for_socket_access(),
-						 errmsg("select() failed: %m")));
-			}
-		}
-		else if (rc == 0)
-		{
-			/* timeout exceeded */
-			if (wakeEvents & WL_TIMEOUT)
-				result |= WL_TIMEOUT;
-		}
-		else
-		{
-			/* at least one event occurred, so check masks */
-			if (FD_ISSET(selfpipe_readfd, &input_mask))
-			{
-				/* There's data in the self-pipe, clear it. */
-				drainSelfPipe();
-			}
-			if ((wakeEvents & WL_SOCKET_READABLE) && FD_ISSET(sock, &input_mask))
-			{
-				/* data available in socket, or EOF */
-				result |= WL_SOCKET_READABLE;
-			}
-			if ((wakeEvents & WL_SOCKET_WRITEABLE) && FD_ISSET(sock, &output_mask))
-			{
-				/* socket is writable, or EOF */
-				result |= WL_SOCKET_WRITEABLE;
-			}
-			if ((wakeEvents & WL_POSTMASTER_DEATH) &&
-				FD_ISSET(postmaster_alive_fds[POSTMASTER_FD_WATCH],
-						 &input_mask))
-			{
-				/*
-				 * According to the select(2) man page on Linux, select(2) may
-				 * spuriously return and report a file descriptor as readable,
-				 * when it's not; and presumably so can poll(2).  It's not
-				 * clear that the relevant cases would ever apply to the
-				 * postmaster pipe, but since the consequences of falsely
-				 * returning WL_POSTMASTER_DEATH could be pretty unpleasant,
-				 * we take the trouble to positively verify EOF with
-				 * PostmasterIsAlive().
-				 */
-				if (!PostmasterIsAlive())
-					result |= WL_POSTMASTER_DEATH;
-			}
-		}
-#endif   /* LATCH_USE_SELECT */
-
-		/*
-		 * Check again whether latch is set, the arrival of a signal/self-byte
-		 * might be what stopped our sleep. It's not required for correctness
-		 * to signal the latch as being set (we'd just loop if there's no
-		 * other event), but it seems good to report an arrived latch asap.
-		 * This way we also don't have to compute the current timestamp again.
-		 */
-		if ((wakeEvents & WL_LATCH_SET) && latch->is_set)
-			result |= WL_LATCH_SET;
-
-		/* If we're not done, update cur_timeout for next iteration */
-		if (result == 0 && (wakeEvents & WL_TIMEOUT))
-		{
-			INSTR_TIME_SET_CURRENT(cur_time);
-			INSTR_TIME_SUBTRACT(cur_time, start_time);
-			cur_timeout = timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
-			if (cur_timeout <= 0)
-			{
-				/* Timeout has expired, no need to continue looping */
-				result |= WL_TIMEOUT;
-			}
-#ifdef LATCH_USE_SELECT
-			else
-			{
-				tv.tv_sec = cur_timeout / 1000L;
-				tv.tv_usec = (cur_timeout % 1000L) * 1000L;
-			}
-#endif
-		}
-	} while (result == 0);
-	waiting = false;
-
-	return result;
-}
-#else							/* LATCH_USE_WIN32 */
-int
-WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
-				  long timeout)
-{
-	DWORD		rc;
-	instr_time	start_time,
-				cur_time;
-	long		cur_timeout;
-	HANDLE		events[4];
-	HANDLE		latchevent;
-	HANDLE		sockevent = WSA_INVALID_EVENT;
-	int			numevents;
-	int			result = 0;
-	int			pmdeath_eventno = 0;
-
-	Assert(wakeEvents != 0);	/* must have at least one wake event */
-
-	/* waiting for socket readiness without a socket indicates a bug */
-	if (sock == PGINVALID_SOCKET &&
-		(wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
-		elog(ERROR, "cannot wait on socket event without a socket");
-
-	if ((wakeEvents & WL_LATCH_SET) && latch->owner_pid != MyProcPid)
-		elog(ERROR, "cannot wait on a latch owned by another process");
-
-	/*
-	 * Initialize timeout if requested.  We must record the current time so
-	 * that we can determine the remaining timeout if WaitForMultipleObjects
-	 * is interrupted.
-	 */
-	if (wakeEvents & WL_TIMEOUT)
-	{
-		INSTR_TIME_SET_CURRENT(start_time);
-		Assert(timeout >= 0 && timeout <= INT_MAX);
-		cur_timeout = timeout;
-	}
-	else
-		cur_timeout = INFINITE;
-
-	/*
-	 * Construct an array of event handles for WaitforMultipleObjects().
-	 *
-	 * Note: pgwin32_signal_event should be first to ensure that it will be
-	 * reported when multiple events are set.  We want to guarantee that
-	 * pending signals are serviced.
-	 */
-	latchevent = latch->event;
-
-	events[0] = pgwin32_signal_event;
-	events[1] = latchevent;
-	numevents = 2;
 	if (wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))
 	{
-		/* Need an event object to represent events on the socket */
-		int			flags = FD_CLOSE;	/* always check for errors/EOF */
+		int			ev;
 
-		if (wakeEvents & WL_SOCKET_READABLE)
-			flags |= FD_READ;
-		if (wakeEvents & WL_SOCKET_WRITEABLE)
-			flags |= FD_WRITE;
-
-		sockevent = WSACreateEvent();
-		if (sockevent == WSA_INVALID_EVENT)
-			elog(ERROR, "failed to create event for socket: error code %u",
-				 WSAGetLastError());
-		if (WSAEventSelect(sock, sockevent, flags) != 0)
-			elog(ERROR, "failed to set up event for socket: error code %u",
-				 WSAGetLastError());
-
-		events[numevents++] = sockevent;
-	}
-	if (wakeEvents & WL_POSTMASTER_DEATH)
-	{
-		pmdeath_eventno = numevents;
-		events[numevents++] = PostmasterHandle;
+		ev = wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE);
+		AddWaitEventToSet(set, ev, sock, NULL, NULL);
 	}
 
-	/* Ensure that signals are serviced even if latch is already set */
-	pgwin32_dispatch_queued_signals();
+	rc = WaitEventSetWait(set, timeout, &event, 1);
 
-	do
+	if (rc == 0)
+		ret |= WL_TIMEOUT;
+	else
 	{
-		/*
-		 * The comment in the unix version above applies here as well. At
-		 * least after mentally replacing self-pipe with windows event.
-		 * There's no danger of overflowing, as "Setting an event that is
-		 * already set has no effect.".
-		 */
-		if ((wakeEvents & WL_LATCH_SET) && latch->is_set)
-		{
-			result |= WL_LATCH_SET;
-
-			/*
-			 * Leave loop immediately, avoid blocking again. We don't attempt
-			 * to report any other events that might also be satisfied.
-			 */
-			break;
-		}
-
-		rc = WaitForMultipleObjects(numevents, events, FALSE, cur_timeout);
-
-		if (rc == WAIT_FAILED)
-			elog(ERROR, "WaitForMultipleObjects() failed: error code %lu",
-				 GetLastError());
-		else if (rc == WAIT_TIMEOUT)
-		{
-			result |= WL_TIMEOUT;
-		}
-		else if (rc == WAIT_OBJECT_0)
-		{
-			/* Service newly-arrived signals */
-			pgwin32_dispatch_queued_signals();
-		}
-		else if (rc == WAIT_OBJECT_0 + 1)
-		{
-			/*
-			 * Reset the event.  We'll re-check the, potentially, set latch on
-			 * next iteration of loop, but let's not waste the cycles to
-			 * update cur_timeout below.
-			 */
-			if (!ResetEvent(latchevent))
-				elog(ERROR, "ResetEvent failed: error code %lu", GetLastError());
-
-			continue;
-		}
-		else if ((wakeEvents & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) &&
-				 rc == WAIT_OBJECT_0 + 2)		/* socket is at event slot 2 */
-		{
-			WSANETWORKEVENTS resEvents;
-
-			ZeroMemory(&resEvents, sizeof(resEvents));
-			if (WSAEnumNetworkEvents(sock, sockevent, &resEvents) != 0)
-				elog(ERROR, "failed to enumerate network events: error code %u",
-					 WSAGetLastError());
-			if ((wakeEvents & WL_SOCKET_READABLE) &&
-				(resEvents.lNetworkEvents & FD_READ))
-			{
-				result |= WL_SOCKET_READABLE;
-			}
-			if ((wakeEvents & WL_SOCKET_WRITEABLE) &&
-				(resEvents.lNetworkEvents & FD_WRITE))
-			{
-				result |= WL_SOCKET_WRITEABLE;
-			}
-			if (resEvents.lNetworkEvents & FD_CLOSE)
-			{
-				if (wakeEvents & WL_SOCKET_READABLE)
-					result |= WL_SOCKET_READABLE;
-				if (wakeEvents & WL_SOCKET_WRITEABLE)
-					result |= WL_SOCKET_WRITEABLE;
-			}
-		}
-		else if ((wakeEvents & WL_POSTMASTER_DEATH) &&
-				 rc == WAIT_OBJECT_0 + pmdeath_eventno)
-		{
-			/*
-			 * Postmaster apparently died.  Since the consequences of falsely
-			 * returning WL_POSTMASTER_DEATH could be pretty unpleasant, we
-			 * take the trouble to positively verify this with
-			 * PostmasterIsAlive(), even though there is no known reason to
-			 * think that the event could be falsely set on Windows.
-			 */
-			if (!PostmasterIsAlive())
-				result |= WL_POSTMASTER_DEATH;
-		}
-		else
-			elog(ERROR, "unexpected return code from WaitForMultipleObjects(): %lu", rc);
-
-		/* If we're not done, update cur_timeout for next iteration */
-		if (result == 0 && (wakeEvents & WL_TIMEOUT))
-		{
-			INSTR_TIME_SET_CURRENT(cur_time);
-			INSTR_TIME_SUBTRACT(cur_time, start_time);
-			cur_timeout = timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
-			if (cur_timeout <= 0)
-			{
-				/* Timeout has expired, no need to continue looping */
-				result |= WL_TIMEOUT;
-			}
-		}
-	} while (result == 0);
-
-	/* Clean up the event object we created for the socket */
-	if (sockevent != WSA_INVALID_EVENT)
-	{
-		WSAEventSelect(sock, NULL, 0);
-		WSACloseEvent(sockevent);
+		ret |= event.events & (WL_LATCH_SET |
+							   WL_POSTMASTER_DEATH |
+							   WL_SOCKET_READABLE |
+							   WL_SOCKET_WRITEABLE);
 	}
 
-	return result;
+	FreeWaitEventSet(set);
+
+	return ret;
 }
-#endif   /* LATCH_USE_WIN32 */
 
 /*
  * Sets a latch and wakes up anyone waiting on it.
@@ -893,6 +473,1018 @@ ResetLatch(volatile Latch *latch)
 }
 
 /*
+ * Create a WaitEventSet with space for nevents different events to wait for.
+ *
+ * These events can then efficiently waited upon together, using
+ * WaitEventSetWait().
+ */
+WaitEventSet *
+CreateWaitEventSet(MemoryContext context, int nevents)
+{
+	WaitEventSet *set;
+	char	   *data;
+	Size		sz = 0;
+
+	sz += sizeof(WaitEventSet);
+	sz += sizeof(WaitEvent) * nevents;
+
+#if defined(WAIT_USE_EPOLL)
+	sz += sizeof(struct epoll_event) * nevents;
+#elif defined(WAIT_USE_POLL)
+	sz += sizeof(struct pollfd) * nevents;
+#elif defined(WAIT_USE_WIN32)
+	/* need space for the pgwin32_signal_event */
+	sz += sizeof(HANDLE) * (nevents + 1);
+#endif
+
+	data = (char *) MemoryContextAllocZero(context, sz);
+
+	set = (WaitEventSet *) data;
+	data += sizeof(WaitEventSet);
+
+	set->events = (WaitEvent *) data;
+	data += sizeof(WaitEvent) * nevents;
+
+#if defined(WAIT_USE_EPOLL)
+	set->epoll_ret_events = (struct epoll_event *) data;
+	data += sizeof(struct epoll_event) * nevents;
+#elif defined(WAIT_USE_POLL)
+	set->pollfds = (struct pollfd *) data;
+	data += sizeof(struct pollfd) * nevents;
+#elif defined(WAIT_USE_WIN32)
+	set->handles = (HANDLE) data;
+	data += sizeof(HANDLE) * nevents;
+#endif
+
+	set->latch = NULL;
+	set->nevents_space = nevents;
+
+#if defined(WAIT_USE_EPOLL)
+	set->epoll_fd = epoll_create(nevents);
+	if (set->epoll_fd < 0)
+		elog(ERROR, "epoll_create failed: %m");
+#elif defined(WAIT_USE_WIN32)
+
+	/*
+	 * To handle signals while waiting, we need to add a win32 specific event.
+	 * We accounted for the additional event at the top of this routine. See
+	 * port/win32/signal.c for more details.
+	 *
+	 * Note: pgwin32_signal_event should be first to ensure that it will be
+	 * reported when multiple events are set.  We want to guarantee that
+	 * pending signals are serviced.
+	 */
+	set->handles[0] = pgwin32_signal_event;
+	StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
+#endif
+
+	return set;
+}
+
+/*
+ * Free a previously created WaitEventSet.
+ */
+void
+FreeWaitEventSet(WaitEventSet *set)
+{
+#if defined(WAIT_USE_EPOLL)
+	close(set->epoll_fd);
+#elif defined(WAIT_USE_WIN32)
+	WaitEvent  *cur_event;
+
+	for (cur_event = set->events;
+		 cur_event < (set->events + set->nevents);
+		 cur_event++)
+	{
+		if (cur_event->events & WL_LATCH_SET)
+		{
+			/* uses the latch's HANDLE */
+		}
+		else if (cur_event->events & WL_POSTMASTER_DEATH)
+		{
+			/* uses PostmasterHandle */
+		}
+		else
+		{
+			/* Clean up the event object we created for the socket */
+			WSAEventSelect(cur_event->fd, NULL, 0);
+			WSACloseEvent(set->handles[cur_event->pos + 1]);
+		}
+	}
+#endif
+
+	pfree(set);
+}
+
+/* ---
+ * Add an event to the set. Possible events are:
+ * - WL_LATCH_SET: Wait for the latch to be set
+ * - WL_POSTMASTER_DEATH: Wait for postmaster to die
+ * - WL_SOCKET_READABLE: Wait for socket to become readable
+ *	 can be combined in one event with WL_SOCKET_WRITEABLE
+ * - WL_SOCKET_WRITEABLE: Wait for socket to become writeable
+ *	 can be combined with WL_SOCKET_READABLE
+ *
+ * Returns the offset in WaitEventSet->events (starting from 0), which can be
+ * used to modify previously added wait events using ModifyWaitEvent().
+ *
+ * In the WL_LATCH_SET case the latch must be owned by the current process,
+ * i.e. it must be a backend-local latch initialized with InitLatch, or a
+ * shared latch associated with the current process by calling OwnLatch.
+ *
+ * In the WL_SOCKET_READABLE/WRITEABLE case, EOF and error conditions are
+ * reported by returning the socket as readable/writable or both, depending on
+ * WL_SOCKET_READABLE/WRITEABLE being specified.
+ *
+ * The user_data pointer specified here will be set for the events returned
+ * by WaitEventSetWait(), allowing to easily associate additional data with
+ * events.
+ */
+int
+AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd, Latch *latch,
+				  void *user_data)
+{
+	WaitEvent  *event;
+
+	/* not enough space */
+	Assert(set->nevents < set->nevents_space);
+
+	if (latch)
+	{
+		if (latch->owner_pid != MyProcPid)
+			elog(ERROR, "cannot wait on a latch owned by another process");
+		if (set->latch)
+			elog(ERROR, "cannot wait on more than one latch");
+		if ((events & WL_LATCH_SET) != WL_LATCH_SET)
+			elog(ERROR, "latch events only spuport being set");
+	}
+	else
+	{
+		if (events & WL_LATCH_SET)
+			elog(ERROR, "cannot wait on latch without a specified latch");
+	}
+
+	/* waiting for socket readiness without a socket indicates a bug */
+	if (fd == PGINVALID_SOCKET &&
+		(events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)))
+		elog(ERROR, "cannot wait on socket event without a socket");
+
+	event = &set->events[set->nevents];
+	event->pos = set->nevents++;
+	event->fd = fd;
+	event->events = events;
+	event->user_data = user_data;
+
+	if (events == WL_LATCH_SET)
+	{
+		set->latch = latch;
+		set->latch_pos = event->pos;
+#ifndef WIN32
+		event->fd = selfpipe_readfd;
+#endif
+	}
+	else if (events == WL_POSTMASTER_DEATH)
+	{
+#ifndef WIN32
+		event->fd = postmaster_alive_fds[POSTMASTER_FD_WATCH];
+#endif
+	}
+
+	/* perform wait primitive specific initialization, if needed */
+#if defined(WAIT_USE_EPOLL)
+	WaitEventAdjustEpoll(set, event, EPOLL_CTL_ADD);
+#elif defined(WAIT_USE_POLL)
+	WaitEventAdjustPoll(set, event);
+#elif defined(WAIT_USE_SELECT)
+	/* nothing to do */
+#elif defined(WAIT_USE_WIN32)
+	WaitEventAdjustWin32(set, event);
+#endif
+
+	return event->pos;
+}
+
+/*
+ * Change the event mask and, in the WL_LATCH_SET case, the latch associated
+ * with the WaitEvent.
+ *
+ * 'pos' is the id returned by AddWaitEventToSet.
+ */
+void
+ModifyWaitEvent(WaitEventSet *set, int pos, uint32 events, Latch *latch)
+{
+	WaitEvent  *event;
+
+	Assert(pos < set->nevents);
+
+	event = &set->events[pos];
+
+	/*
+	 * If neither the event mask nor the associated latch changes, return
+	 * early. That's an important optimization for some sockets, where
+	 * ModifyWaitEvent is frequently used to switch from waiting for reads to
+	 * waiting on writes.
+	 */
+	if (events == event->events &&
+		(!(event->events & WL_LATCH_SET) || set->latch == latch))
+		return;
+
+	if (event->events & WL_LATCH_SET &&
+		events != event->events)
+	{
+		/* we could allow to disable latch events for a while */
+		elog(ERROR, "cannot modify latch event");
+	}
+
+	if (event->events & WL_POSTMASTER_DEATH)
+	{
+		elog(ERROR, "cannot modify postmaster death event");
+	}
+
+	/* FIXME: validate event mask */
+	event->events = events;
+
+	if (events == WL_LATCH_SET)
+	{
+		set->latch = latch;
+	}
+
+#if defined(WAIT_USE_EPOLL)
+	WaitEventAdjustEpoll(set, event, EPOLL_CTL_MOD);
+#elif defined(WAIT_USE_POLL)
+	WaitEventAdjustPoll(set, event);
+#elif defined(WAIT_USE_SELECT)
+	/* nothing to do */
+#elif defined(WAIT_USE_WIN32)
+	WaitEventAdjustWin32(set, event);
+#endif
+}
+
+#if defined(WAIT_USE_EPOLL)
+/*
+ * action can be one of EPOLL_CTL_ADD | EPOLL_CTL_MOD | EPOLL_CTL_DEL
+ */
+static void
+WaitEventAdjustEpoll(WaitEventSet *set, WaitEvent *event, int action)
+{
+	struct epoll_event epoll_ev;
+	int			rc;
+
+	/* pointer to our event, returned by epoll_wait */
+	epoll_ev.data.ptr = event;
+	/* always wait for errors */
+	epoll_ev.events = EPOLLERR | EPOLLHUP;
+
+	/* prepare pollfd entry once */
+	if (event->events == WL_LATCH_SET)
+	{
+		Assert(set->latch != NULL);
+		epoll_ev.events |= EPOLLIN;
+	}
+	else if (event->events == WL_POSTMASTER_DEATH)
+	{
+		epoll_ev.events |= EPOLLIN;
+	}
+	else
+	{
+		Assert(event->fd != PGINVALID_SOCKET);
+		Assert(event->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE));
+
+		if (event->events & WL_SOCKET_READABLE)
+			epoll_ev.events |= EPOLLIN;
+		if (event->events & WL_SOCKET_WRITEABLE)
+			epoll_ev.events |= EPOLLOUT;
+	}
+
+	/*
+	 * Even though unused, we also pass epoll_ev as the data argument if
+	 * EPOLL_CTL_DEL is passed as action.  There used to be an epoll bug
+	 * requiring that, and actually it makes the code simpler...
+	 */
+	rc = epoll_ctl(set->epoll_fd, action, event->fd, &epoll_ev);
+
+	if (rc < 0)
+		ereport(ERROR,
+				(errcode_for_socket_access(),
+				 errmsg("epoll_ctl() failed: %m")));
+}
+#endif
+
+#if defined(WAIT_USE_POLL)
+static void
+WaitEventAdjustPoll(WaitEventSet *set, WaitEvent *event)
+{
+	struct pollfd *pollfd = &set->pollfds[event->pos];
+
+	pollfd->revents = 0;
+	pollfd->fd = event->fd;
+
+	/* prepare pollfd entry once */
+	if (event->events == WL_LATCH_SET)
+	{
+		Assert(set->latch != NULL);
+		pollfd->events = POLLIN;
+	}
+	else if (event->events == WL_POSTMASTER_DEATH)
+	{
+		pollfd->events = POLLIN;
+	}
+	else
+	{
+		Assert(event->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE));
+		pollfd->events = 0;
+		if (event->events & WL_SOCKET_READABLE)
+			pollfd->events |= POLLIN;
+		if (event->events & WL_SOCKET_WRITEABLE)
+			pollfd->events |= POLLOUT;
+	}
+
+	Assert(event->fd != PGINVALID_SOCKET);
+}
+#endif
+
+#if defined(WAIT_USE_WIN32)
+static void
+WaitEventAdjustWin32(WaitEventSet *set, WaitEvent *event)
+{
+	HANDLE	   *handle = &set->handles[event->pos + 1];
+
+	if (event->events == WL_LATCH_SET)
+	{
+		Assert(set->latch != NULL);
+		*handle = set->latch->event;
+	}
+	else if (event->events == WL_POSTMASTER_DEATH)
+	{
+		*handle = PostmasterHandle;
+	}
+	else
+	{
+		int			flags = FD_CLOSE;	/* always check for errors/EOF */
+
+		if (event->events & WL_SOCKET_READABLE)
+			flags |= FD_READ;
+		if (event->events & WL_SOCKET_WRITEABLE)
+			flags |= FD_WRITE;
+
+		if (*handle == WSA_INVALID_EVENT)
+		{
+			*handle = WSACreateEvent();
+			if (*handle == WSA_INVALID_EVENT)
+				elog(ERROR, "failed to create event for socket: error code %u",
+					 WSAGetLastError());
+		}
+		if (WSAEventSelect(event->fd, *handle, flags) != 0)
+			elog(ERROR, "failed to set up event for socket: error code %u",
+				 WSAGetLastError());
+
+		Assert(event->fd != PGINVALID_SOCKET);
+	}
+}
+#endif
+
+/*
+ * Wait for events added to the set to happen, or until the timeout is
+ * reached.  At most nevents occurred events are returned.
+ *
+ * If timeout = -1, block until an event occurs; if 0, check sockets for
+ * readiness, but don't block; if > 0, block for at most timeout miliseconds.
+ *
+ * Returns the number of events occurred, or 0 if the timeout was reached.
+ *
+ * Returned events will have the fd, pos, user_data fields set to the
+ * values associated with the registered event.
+ */
+int
+WaitEventSetWait(WaitEventSet *set, long timeout,
+				 WaitEvent *occurred_events, int nevents)
+{
+	int			returned_events = 0;
+	instr_time	start_time;
+	instr_time	cur_time;
+	long		cur_timeout = -1;
+
+	Assert(nevents > 0);
+
+	/*
+	 * Initialize timeout if requested.  We must record the current time so
+	 * that we can determine the remaining timeout if interrupted.
+	 */
+	if (timeout >= 0)
+	{
+		INSTR_TIME_SET_CURRENT(start_time);
+		Assert(timeout >= 0 && timeout <= INT_MAX);
+		cur_timeout = timeout;
+	}
+
+#ifndef WIN32
+	waiting = true;
+#else
+	/* Ensure that signals are serviced even if latch is already set */
+	pgwin32_dispatch_queued_signals();
+#endif
+	while (returned_events == 0)
+	{
+		int			rc;
+
+		/*
+		 * Check if the latch is set already. If so, leave the loop
+		 * immediately, avoid blocking again. We don't attempt to report any
+		 * other events that might also be satisfied.
+		 *
+		 * If someone sets the latch between this and the
+		 * WaitEventSetWaitBlock() below, the setter will write a byte to the
+		 * pipe (or signal us and the signal handler will do that), and the
+		 * readiness routine will return immediately.
+		 *
+		 * On unix, If there's a pending byte in the self pipe, we'll notice
+		 * whenever blocking. Only clearing the pipe in that case avoids
+		 * having to drain it every time WaitLatchOrSocket() is used. Should
+		 * the pipe-buffer fill up we're still ok, because the pipe is in
+		 * nonblocking mode. It's unlikely for that to happen, because the
+		 * self pipe isn't filled unless we're blocking (waiting = true), or
+		 * from inside a signal handler in latch_sigusr1_handler().
+		 *
+		 * On windows, we'll also notice if there's a pending event for the
+		 * latch when blocking, but there's no danger of anything filling up,
+		 * as "Setting an event that is already set has no effect.".
+		 *
+		 * Note: we assume that the kernel calls involved in latch management
+		 * will provide adequate synchronization on machines with weak memory
+		 * ordering, so that we cannot miss seeing is_set if a notification
+		 * has already been queued.
+		 */
+		if (set->latch && set->latch->is_set)
+		{
+			occurred_events->fd = PGINVALID_SOCKET;
+			occurred_events->pos = set->latch_pos;
+			occurred_events->user_data =
+				set->events[set->latch_pos].user_data;
+			occurred_events->events = WL_LATCH_SET;
+			occurred_events++;
+			returned_events++;
+
+			break;
+		}
+
+		/*
+		 * Wait for events using the readiness primitive chosen at the top of
+		 * this file. If -1 is returned, a timeout has occurred, if 0 we have
+		 * to retry, everything >= 1 is the number of returned events.
+		 */
+		rc = WaitEventSetWaitBlock(set, cur_timeout,
+								   occurred_events, nevents);
+
+		if (rc == -1)
+			break;				/* timeout occurred */
+		else
+			returned_events = rc;
+
+		/* If we're not done, update cur_timeout for next iteration */
+		if (returned_events == 0 && timeout >= 0)
+		{
+			INSTR_TIME_SET_CURRENT(cur_time);
+			INSTR_TIME_SUBTRACT(cur_time, start_time);
+			cur_timeout = timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
+			if (cur_timeout <= 0)
+				break;
+		}
+	}
+#ifndef WIN32
+	waiting = false;
+#endif
+
+	return returned_events;
+}
+
+
+#if defined(WAIT_USE_EPOLL)
+
+/*
+ * Wait using linux's epoll_wait(2).
+ *
+ * This is the preferrable wait method, as several readiness notifications are
+ * delivered, without having to iterate through all of set->events. The return
+ * epoll_event struct contain a pointer to our events, making association
+ * easy.
+ */
+static int
+WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
+					  WaitEvent *occurred_events, int nevents)
+{
+	int			returned_events = 0;
+	int			rc;
+	WaitEvent  *cur_event;
+	struct epoll_event *cur_epoll_event;
+
+	/* Sleep */
+	rc = epoll_wait(set->epoll_fd, set->epoll_ret_events,
+					nevents, cur_timeout);
+
+	/* Check return code */
+	if (rc < 0)
+	{
+		/* EINTR is okay, otherwise complain */
+		if (errno != EINTR)
+		{
+			waiting = false;
+			ereport(ERROR,
+					(errcode_for_socket_access(),
+					 errmsg("epoll_wait() failed: %m")));
+		}
+		return 0;
+	}
+	else if (rc == 0)
+	{
+		/* timeout exceeded */
+		return -1;
+	}
+
+	/*
+	 * At least one event occurred, iterate over the returned epoll events
+	 * until they're either all processed, or we've returned all the events
+	 * the caller desired.
+	 */
+	for (cur_epoll_event = set->epoll_ret_events;
+		 cur_epoll_event < (set->epoll_ret_events + rc) &&
+		 returned_events < nevents;
+		 cur_epoll_event++)
+	{
+		/* epoll's data pointer is set to the associated WaitEvent */
+		cur_event = (WaitEvent *) cur_epoll_event->data.ptr;
+
+		occurred_events->pos = cur_event->pos;
+		occurred_events->user_data = cur_event->user_data;
+		occurred_events->events = 0;
+
+		if (cur_event->events == WL_LATCH_SET &&
+			cur_epoll_event->events & (EPOLLIN | EPOLLERR | EPOLLHUP))
+		{
+			/* There's data in the self-pipe, clear it. */
+			drainSelfPipe();
+
+			if (set->latch->is_set)
+			{
+				occurred_events->fd = PGINVALID_SOCKET;
+				occurred_events->events = WL_LATCH_SET;
+				occurred_events++;
+				returned_events++;
+			}
+		}
+		else if (cur_event->events == WL_POSTMASTER_DEATH &&
+				 cur_epoll_event->events & (EPOLLIN | EPOLLERR | EPOLLHUP))
+		{
+			/*
+			 * We expect an EPOLLHUP when the remote end is closed, but
+			 * because we don't expect the pipe to become readable or to have
+			 * any errors either, treat those cases as postmaster death, too.
+			 *
+			 * As explained in the WAIT_USE_SELECT implementation, select(2)
+			 * may spuriously return. Be paranoid about that here too, a
+			 * spurious WL_POSTMASTER_DEATH would be painful.
+			 */
+			if (!PostmasterIsAlive())
+			{
+				occurred_events->fd = PGINVALID_SOCKET;
+				occurred_events->events = WL_POSTMASTER_DEATH;
+				occurred_events++;
+				returned_events++;
+			}
+		}
+		else if (cur_event->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))
+		{
+			Assert(cur_event->fd != PGINVALID_SOCKET);
+
+			if ((cur_event->events & WL_SOCKET_READABLE) &&
+				(cur_epoll_event->events & (EPOLLIN | EPOLLERR | EPOLLHUP)))
+			{
+				/* data available in socket, or EOF */
+				occurred_events->events |= WL_SOCKET_READABLE;
+			}
+
+			if ((cur_event->events & WL_SOCKET_WRITEABLE) &&
+				(cur_epoll_event->events & (EPOLLOUT | EPOLLERR | EPOLLHUP)))
+			{
+				/* writable, or EOF */
+				occurred_events->events |= WL_SOCKET_WRITEABLE;
+			}
+
+			if (occurred_events->events != 0)
+			{
+				occurred_events->fd = cur_event->fd;
+				occurred_events++;
+				returned_events++;
+			}
+		}
+	}
+
+	return returned_events;
+}
+
+#elif defined(WAIT_USE_POLL)
+
+/*
+ * Wait using poll(2).
+ *
+ * This allows to receive readiness notifications for several events at once,
+ * but requires iterating through all of set->pollfds.
+ */
+static inline int
+WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
+					  WaitEvent *occurred_events, int nevents)
+{
+	int			returned_events = 0;
+	int			rc;
+	WaitEvent  *cur_event;
+	struct pollfd *cur_pollfd;
+
+	/* Sleep */
+	rc = poll(set->pollfds, set->nevents, (int) cur_timeout);
+
+	/* Check return code */
+	if (rc < 0)
+	{
+		/* EINTR is okay, otherwise complain */
+		if (errno != EINTR)
+		{
+			waiting = false;
+			ereport(ERROR,
+					(errcode_for_socket_access(),
+					 errmsg("poll() failed: %m")));
+		}
+		return 0;
+	}
+	else if (rc == 0)
+	{
+		/* timeout exceeded */
+		return -1;
+	}
+
+	for (cur_event = set->events, cur_pollfd = set->pollfds;
+		 cur_event < (set->events + set->nevents) &&
+		 returned_events < nevents;
+		 cur_event++, cur_pollfd++)
+	{
+		/* no activity on this FD, skip */
+		if (cur_pollfd->revents == 0)
+			continue;
+
+		occurred_events->pos = cur_event->pos;
+		occurred_events->user_data = cur_event->user_data;
+		occurred_events->events = 0;
+
+		if (cur_event->events == WL_LATCH_SET &&
+			(cur_pollfd->revents & (POLLIN | POLLHUP | POLLERR | POLLNVAL)))
+		{
+			/* There's data in the self-pipe, clear it. */
+			drainSelfPipe();
+
+			if (set->latch->is_set)
+			{
+				occurred_events->fd = PGINVALID_SOCKET;
+				occurred_events->events = WL_LATCH_SET;
+				occurred_events++;
+				returned_events++;
+			}
+		}
+		else if (cur_event->events == WL_POSTMASTER_DEATH &&
+			 (cur_pollfd->revents & (POLLIN | POLLHUP | POLLERR | POLLNVAL)))
+		{
+			/*
+			 * We expect an POLLHUP when the remote end is closed, but because
+			 * we don't expect the pipe to become readable or to have any
+			 * errors either, treat those cases as postmaster death, too.
+			 *
+			 * As explained in the WAIT_USE_SELECT implementation, select(2)
+			 * may spuriously return. Be paranoid about that here too, a
+			 * spurious WL_POSTMASTER_DEATH would be painful.
+			 */
+			if (!PostmasterIsAlive())
+			{
+				occurred_events->fd = PGINVALID_SOCKET;
+				occurred_events->events = WL_POSTMASTER_DEATH;
+				occurred_events++;
+				returned_events++;
+			}
+		}
+		else if (cur_event->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))
+		{
+			int			errflags = POLLHUP | POLLERR | POLLNVAL;
+
+			Assert(cur_event->fd >= PGINVALID_SOCKET);
+
+			if ((cur_event->events & WL_SOCKET_READABLE) &&
+				(cur_pollfd->revents & (POLLIN | errflags)))
+			{
+				/* data available in socket, or EOF */
+				occurred_events->events |= WL_SOCKET_READABLE;
+			}
+
+			if ((cur_event->events & WL_SOCKET_WRITEABLE) &&
+				(cur_pollfd->revents & (POLLOUT | errflags)))
+			{
+				/* writeable, or EOF */
+				occurred_events->events |= WL_SOCKET_WRITEABLE;
+			}
+
+			if (occurred_events->events != 0)
+			{
+				occurred_events->fd = cur_event->fd;
+				occurred_events++;
+				returned_events++;
+			}
+		}
+	}
+	return returned_events;
+}
+
+#elif defined(WAIT_USE_SELECT)
+
+/*
+ * Wait using select(2).
+ *
+ * XXX: On at least older linux kernels select(), in violation of POSIX,
+ * doesn't reliably return a socket as writable if closed - but we rely on
+ * that. So far all the known cases of this problem are on platforms that also
+ * provide a poll() implementation without that bug.  If we find one where
+ * that's not the case, we'll need to add a workaround.
+ */
+static inline int
+WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
+					  WaitEvent *occurred_events, int nevents)
+{
+	int			returned_events = 0;
+	int			rc;
+	WaitEvent  *cur_event;
+	fd_set		input_mask;
+	fd_set		output_mask;
+	int			hifd;
+	struct timeval tv;
+	struct timeval *tvp = NULL;
+
+	FD_ZERO(&input_mask);
+	FD_ZERO(&output_mask);
+
+	/*
+	 * Prepare input/output masks. We do so every loop iteration as there's no
+	 * entirely portable way to copy fd_sets.
+	 */
+	for (cur_event = set->events;
+		 cur_event < (set->events + set->nevents);
+		 cur_event++)
+	{
+		if (cur_event->events == WL_LATCH_SET)
+			FD_SET(cur_event->fd, &input_mask);
+		else if (cur_event->events == WL_POSTMASTER_DEATH)
+			FD_SET(cur_event->fd, &input_mask);
+		else
+		{
+			Assert(cur_event->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE));
+			if (cur_event->events == WL_SOCKET_READABLE)
+				FD_SET(cur_event->fd, &input_mask);
+			else if (cur_event->events == WL_SOCKET_WRITEABLE)
+				FD_SET(cur_event->fd, &output_mask);
+		}
+
+		if (cur_event->fd > hifd)
+			hifd = cur_event->fd;
+	}
+
+	/* Sleep */
+	if (cur_timeout >= 0)
+	{
+		tv.tv_sec = cur_timeout / 1000L;
+		tv.tv_usec = (cur_timeout % 1000L) * 1000L;
+		tvp = &tv;
+	}
+	rc = select(hifd + 1, &input_mask, &output_mask, NULL, tvp);
+
+	/* Check return code */
+	if (rc < 0)
+	{
+		/* EINTR is okay, otherwise complain */
+		if (errno != EINTR)
+		{
+			waiting = false;
+			ereport(ERROR,
+					(errcode_for_socket_access(),
+					 errmsg("select() failed: %m")));
+		}
+		return 0;				/* retry */
+	}
+	else if (rc == 0)
+	{
+		/* timeout exceeded */
+		return -1;
+	}
+
+	/*
+	 * To associate events with select's masks, we have to check the status of
+	 * the file descriptors associated with an event; by looping through all
+	 * events.
+	 */
+	for (cur_event = set->events;
+		 cur_event < (set->events + set->nevents)
+		 && returned_events < nevents;
+		 cur_event++)
+	{
+		occurred_events->pos = cur_event->pos;
+		occurred_events->user_data = cur_event->user_data;
+		occurred_events->events = 0;
+
+		if (cur_event->events == WL_LATCH_SET &&
+			FD_ISSET(cur_event->fd, &input_mask))
+		{
+			/* There's data in the self-pipe, clear it. */
+			drainSelfPipe();
+
+			if (set->latch->is_set)
+			{
+				occurred_events->fd = PGINVALID_SOCKET;
+				occurred_events->events = WL_LATCH_SET;
+				occurred_events++;
+				returned_events++;
+			}
+		}
+		else if (cur_event->events == WL_POSTMASTER_DEATH &&
+				 FD_ISSET(cur_event->fd, &input_mask))
+		{
+			/*
+			 * According to the select(2) man page on Linux, select(2) may
+			 * spuriously return and report a file descriptor as readable,
+			 * when it's not; and presumably so can poll(2).  It's not clear
+			 * that the relevant cases would ever apply to the postmaster
+			 * pipe, but since the consequences of falsely returning
+			 * WL_POSTMASTER_DEATH could be pretty unpleasant, we take the
+			 * trouble to positively verify EOF with PostmasterIsAlive().
+			 */
+			if (!PostmasterIsAlive())
+			{
+				occurred_events->fd = PGINVALID_SOCKET;
+				occurred_events->events = WL_POSTMASTER_DEATH;
+				occurred_events++;
+				returned_events++;
+			}
+		}
+		else if (cur_event->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))
+		{
+			Assert(cur_event->fd != PGINVALID_SOCKET);
+
+			if ((cur_event->events & WL_SOCKET_READABLE) &&
+				FD_ISSET(cur_event->fd, &input_mask))
+			{
+				/* data available in socket, or EOF */
+				occurred_events->events |= WL_SOCKET_READABLE;
+			}
+
+			if ((cur_event->events & WL_SOCKET_WRITEABLE) &&
+				FD_ISSET(cur_event->fd, &output_mask))
+			{
+				/* socket is writeable, or EOF */
+				occurred_events->events |= WL_SOCKET_WRITEABLE;
+			}
+
+			if (occurred_events->events != 0)
+			{
+				occurred_events->fd = cur_event->fd;
+				occurred_events++;
+				returned_events++;
+			}
+		}
+	}
+	return returned_events;
+}
+
+#elif defined(WAIT_USE_WIN32)
+
+/*
+ * Wait using Windows' WaitForMultipleObjects().
+ *
+ * Unfortunately this will only ever return a single readiness notification at
+ * a time.  Note that while the official documentation for
+ * WaitForMultipleObjects is ambiguous about multiple events being "consumed"
+ * with a single bWaitAll = FALSE call,
+ * https://blogs.msdn.microsoft.com/oldnewthing/20150409-00/?p=44273 confirms
+ * that only one event is "consumed".
+ */
+static inline int
+WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
+					  WaitEvent *occurred_events, int nevents)
+{
+	int			returned_events = 0;
+	DWORD		rc;
+	WaitEvent  *cur_event;
+
+	/*
+	 * Sleep.
+	 *
+	 * Need to wait for ->nevents + 1, because signal handle is in [0].
+	 */
+	rc = WaitForMultipleObjects(set->nevents + 1, set->handles, FALSE,
+								cur_timeout);
+
+	/* Check return code */
+	if (rc == WAIT_FAILED)
+		elog(ERROR, "WaitForMultipleObjects() failed: error code %lu",
+			 GetLastError());
+	else if (rc == WAIT_TIMEOUT)
+	{
+		/* timeout exceeded */
+		return -1;
+	}
+
+	if (rc == WAIT_OBJECT_0)
+	{
+		/* Service newly-arrived signals */
+		pgwin32_dispatch_queued_signals();
+		return 0;				/* retry */
+	}
+
+	/*
+	 * With an offset of one, due to the always present pgwin32_signal_event,
+	 * the handle offset directly corresponds to a wait event.
+	 */
+	cur_event = (WaitEvent *) &set->events[rc - WAIT_OBJECT_0 - 1];
+
+	occurred_events->pos = cur_event->pos;
+	occurred_events->user_data = cur_event->user_data;
+	occurred_events->events = 0;
+
+	if (cur_event->events == WL_LATCH_SET)
+	{
+		if (!ResetEvent(set->latch->event))
+			elog(ERROR, "ResetEvent failed: error code %lu", GetLastError());
+
+		if (set->latch->is_set)
+		{
+			occurred_events->fd = PGINVALID_SOCKET;
+			occurred_events->events = WL_LATCH_SET;
+			occurred_events++;
+			returned_events++;
+		}
+	}
+	else if (cur_event->events == WL_POSTMASTER_DEATH)
+	{
+		/*
+		 * Postmaster apparently died.  Since the consequences of falsely
+		 * returning WL_POSTMASTER_DEATH could be pretty unpleasant, we take
+		 * the trouble to positively verify this with PostmasterIsAlive(),
+		 * even though there is no known reason to think that the event could
+		 * be falsely set on Windows.
+		 */
+		if (!PostmasterIsAlive())
+		{
+			occurred_events->fd = PGINVALID_SOCKET;
+			occurred_events->events = WL_POSTMASTER_DEATH;
+			occurred_events++;
+			returned_events++;
+		}
+	}
+	else if (cur_event->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))
+	{
+		WSANETWORKEVENTS resEvents;
+
+		Assert(cur_event->fd);
+
+		occurred_events->fd = cur_event->fd;
+
+		ZeroMemory(&resEvents, sizeof(resEvents));
+		if (WSAEnumNetworkEvents(cur_event->fd, set->handles[cur_event->pos + 1], &resEvents) != 0)
+			elog(ERROR, "failed to enumerate network events: error code %u",
+				 WSAGetLastError());
+		if ((cur_event->events & WL_SOCKET_READABLE) &&
+			(resEvents.lNetworkEvents & FD_READ))
+		{
+			/* data available in socket */
+			occurred_events->events |= WL_SOCKET_READABLE;
+		}
+		if ((cur_event->events & WL_SOCKET_WRITEABLE) &&
+			(resEvents.lNetworkEvents & FD_WRITE))
+		{
+			/* writeable */
+			occurred_events->events |= WL_SOCKET_WRITEABLE;
+		}
+		if (resEvents.lNetworkEvents & FD_CLOSE)
+		{
+			/* EOF */
+			if (cur_event->events & WL_SOCKET_READABLE)
+				occurred_events->events |= WL_SOCKET_READABLE;
+			if (cur_event->events & WL_SOCKET_WRITEABLE)
+				occurred_events->events |= WL_SOCKET_WRITEABLE;
+		}
+
+		if (occurred_events->events != 0)
+		{
+			occurred_events++;
+			returned_events++;
+		}
+	}
+
+	return returned_events;
+}
+#endif
+
+/*
  * SetLatch uses SIGUSR1 to wake up the process waiting on the latch.
  *
  * Wake up WaitLatch, if we're waiting.  (We might not be, since SIGUSR1 is
diff --git a/src/backend/utils/init/miscinit.c b/src/backend/utils/init/miscinit.c
index 18f5e6f..d13355b 100644
--- a/src/backend/utils/init/miscinit.c
+++ b/src/backend/utils/init/miscinit.c
@@ -33,6 +33,7 @@
 
 #include "access/htup_details.h"
 #include "catalog/pg_authid.h"
+#include "libpq/libpq.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "postmaster/autovacuum.h"
@@ -247,6 +248,9 @@ SwitchToSharedLatch(void)
 
 	MyLatch = &MyProc->procLatch;
 
+	if (FeBeWaitSet)
+		ModifyWaitEvent(FeBeWaitSet, 1, WL_LATCH_SET, MyLatch);
+
 	/*
 	 * Set the shared latch as the local one might have been set. This
 	 * shouldn't normally be necessary as code is supposed to check the
@@ -262,6 +266,10 @@ SwitchBackToLocalLatch(void)
 	Assert(MyProc != NULL && MyLatch == &MyProc->procLatch);
 
 	MyLatch = &LocalLatchData;
+
+	if (FeBeWaitSet)
+		ModifyWaitEvent(FeBeWaitSet, 1, WL_LATCH_SET, MyLatch);
+
 	SetLatch(MyLatch);
 }
 
diff --git a/src/include/libpq/libpq.h b/src/include/libpq/libpq.h
index 0569994..109fdf7 100644
--- a/src/include/libpq/libpq.h
+++ b/src/include/libpq/libpq.h
@@ -19,6 +19,7 @@
 
 #include "lib/stringinfo.h"
 #include "libpq/libpq-be.h"
+#include "storage/latch.h"
 
 
 typedef struct
@@ -95,6 +96,8 @@ extern ssize_t secure_raw_write(Port *port, const void *ptr, size_t len);
 
 extern bool ssl_loaded_verify_locations;
 
+WaitEventSet *FeBeWaitSet;
+
 /* GUCs */
 extern char *SSLCipherSuites;
 extern char *SSLECDHCurve;
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 3813226..c72635c 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -530,6 +530,9 @@
 /* Define to 1 if you have the syslog interface. */
 #undef HAVE_SYSLOG
 
+/* Define to 1 if you have the <sys/epoll.h> header file. */
+#undef HAVE_SYS_EPOLL_H
+
 /* Define to 1 if you have the <sys/ioctl.h> header file. */
 #undef HAVE_SYS_IOCTL_H
 
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 1b9521f..85d211c 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -68,6 +68,12 @@
  * use of any generic handler.
  *
  *
+ * WaitEventSets allow to wait for latches being set and additional events -
+ * postmaster dying and socket readiness of several sockets currently - at the
+ * same time.  On many platforms using a long lived event set is more
+ * efficient than using WaitLatch or WaitLatchOrSocket.
+ *
+ *
  * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
@@ -95,13 +101,27 @@ typedef struct Latch
 #endif
 } Latch;
 
-/* Bitmasks for events that may wake-up WaitLatch() clients */
+/*
+ * Bitmasks for events that may wake-up WaitLatch(), WaitLatchOrSocket(), or
+ * WaitEventSetWait().
+ */
 #define WL_LATCH_SET		 (1 << 0)
 #define WL_SOCKET_READABLE	 (1 << 1)
 #define WL_SOCKET_WRITEABLE  (1 << 2)
-#define WL_TIMEOUT			 (1 << 3)
+#define WL_TIMEOUT			 (1 << 3)	/* not for WaitEventSetWait() */
 #define WL_POSTMASTER_DEATH  (1 << 4)
 
+typedef struct WaitEvent
+{
+	int			pos;			/* position in the event data structure */
+	uint32		events;			/* triggered events */
+	pgsocket	fd;				/* socket fd associated with event */
+	void	   *user_data;		/* pointer provided in AddWaitEventToSet */
+} WaitEvent;
+
+/* forward declaration to avoid exposing latch.c implementation details */
+typedef struct WaitEventSet WaitEventSet;
+
 /*
  * prototypes for functions in latch.c
  */
@@ -110,12 +130,19 @@ extern void InitLatch(volatile Latch *latch);
 extern void InitSharedLatch(volatile Latch *latch);
 extern void OwnLatch(volatile Latch *latch);
 extern void DisownLatch(volatile Latch *latch);
-extern int	WaitLatch(volatile Latch *latch, int wakeEvents, long timeout);
-extern int WaitLatchOrSocket(volatile Latch *latch, int wakeEvents,
-				  pgsocket sock, long timeout);
 extern void SetLatch(volatile Latch *latch);
 extern void ResetLatch(volatile Latch *latch);
 
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern void FreeWaitEventSet(WaitEventSet *set);
+extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
+				  Latch *latch, void *user_data);
+extern void ModifyWaitEvent(WaitEventSet *set, int pos, uint32 events, Latch *latch);
+
+extern int	WaitEventSetWait(WaitEventSet *set, long timeout, WaitEvent *occurred_events, int nevents);
+extern int	WaitLatch(volatile Latch *latch, int wakeEvents, long timeout);
+extern int WaitLatchOrSocket(volatile Latch *latch, int wakeEvents,
+				  pgsocket sock, long timeout);
 
 /*
  * Unix implementation uses SIGUSR1 for inter-process signaling.
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b850db0..c2511de 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2113,6 +2113,8 @@ WalSnd
 WalSndCtlData
 WalSndSendDataCallback
 WalSndState
+WaitEvent
+WaitEventSet
 WholeRowVarExprState
 WindowAgg
 WindowAggState
-- 
2.7.0.229.g701fa7f

#88

Andres Freund

andres@anarazel.de

almost 10 years ago

In reply to: Васильев Дмитрий (#1)

Re: Performance degradation in commit ac1d794

Hi,

I've pushed the new API. We might want to use it in more places...

On 2015-12-25 20:08:15 +0300, Васильев Дмитрий wrote:

I suddenly found commit ac1d794 gives up to 3 times performance degradation.

I tried to run pgbench -s 1000 -j 48 -c 48 -S -M prepared on 70 CPU-core
machine:
commit ac1d794 gives me 363,474 tps
and previous commit a05dc4d gives me 956,146
and master( 3d0c50f ) with revert ac1d794 gives me 969,265

Could you please verify that the performance figures are good now? I've
tested this on a larger two socket system, and there I managed to found
a configuration showing the slowdown (-c64 -j4), and I confirmed the
fix.

Regards,

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#89

Thomas Munro

thomas.munro@enterprisedb.com

almost 10 years ago

In reply to: Andres Freund (#85)

1 attachment(s)

Re: Performance degradation in commit ac1d794

On Mon, Mar 21, 2016 at 6:09 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-03-21 11:52:43 +1300, Thomas Munro wrote:

* I would be interested in writing a kqueue implementation of this for
*BSD (and MacOSX?) at some point if someone doesn't beat me to it.

I hoped that somebody would do that - that'd afaics be the only major
API missing.

Here's a first swing at it, though I need to find some spare time to
test and figure out if some details like error conditions and EOF are
handled correctly. It could in theory minimise the number of syscalls
it makes by buffering changes (additions and hopefully one day
removals) and then use a single kevent syscall to apply all
modifications to the set and begin waiting, but for now it mirrors the
epoll code. One user-visible difference compared to epoll/poll/select
is that it delivers readable and writable events separately if both
conditions are true for a single fd. It builds and appears to work
correctly on FreeBSD 10.2 and MacOSX 10.10.2. Sharing this early
version in case any BSD users have any feedback. I hope to do some
testing this weekend.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

kqueue.patchapplication/octet-stream; name=kqueue.patchDownload

diff --git a/configure b/configure
index 24655dc..e239641 100755
--- a/configure
+++ b/configure
@@ -10193,7 +10193,7 @@ fi
 ## Header files
 ##
 
-for ac_header in atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h pwd.h sys/epoll.h sys/ioctl.h sys/ipc.h sys/poll.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/socket.h sys/sockio.h sys/tas.h sys/time.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h
+for ac_header in atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h pwd.h sys/epoll.h sys/event.h sys/ioctl.h sys/ipc.h sys/poll.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/socket.h sys/sockio.h sys/tas.h sys/time.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h
 do :
   as_ac_Header=`$as_echo "ac_cv_header_$ac_header" | $as_tr_sh`
 ac_fn_c_check_header_mongrel "$LINENO" "$ac_header" "$as_ac_Header" "$ac_includes_default"
@@ -12425,7 +12425,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat pthread_is_threaded_np readlink setproctitle setsid shm_open symlink sync_file_range towlower utime utimes wcstombs wcstombs_l
+for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit kqueue mbstowcs_l memmove poll pstat pthread_is_threaded_np readlink setproctitle setsid shm_open symlink sync_file_range towlower utime utimes wcstombs wcstombs_l
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.in b/configure.in
index c564a76..ef9d450 100644
--- a/configure.in
+++ b/configure.in
@@ -1183,7 +1183,7 @@ AC_SUBST(UUID_LIBS)
 ##
 
 dnl sys/socket.h is required by AC_FUNC_ACCEPT_ARGTYPES
-AC_CHECK_HEADERS([atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h pwd.h sys/epoll.h sys/ioctl.h sys/ipc.h sys/poll.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/socket.h sys/sockio.h sys/tas.h sys/time.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h])
+AC_CHECK_HEADERS([atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h pwd.h sys/epoll.h sys/event.h sys/ioctl.h sys/ipc.h sys/poll.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/socket.h sys/sockio.h sys/tas.h sys/time.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h])
 
 # On BSD, test for net/if.h will fail unless sys/socket.h
 # is included first.
@@ -1432,7 +1432,7 @@ PGAC_FUNC_WCSTOMBS_L
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-AC_CHECK_FUNCS([cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat pthread_is_threaded_np readlink setproctitle setsid shm_open symlink sync_file_range towlower utime utimes wcstombs wcstombs_l])
+AC_CHECK_FUNCS([cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit kqueue mbstowcs_l memmove poll pstat pthread_is_threaded_np readlink setproctitle setsid shm_open symlink sync_file_range towlower utime utimes wcstombs wcstombs_l])
 
 AC_REPLACE_FUNCS(fseeko)
 case $host_os in
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 42c2f52..98b8b7f 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -44,6 +44,9 @@
 #ifdef HAVE_SYS_EPOLL_H
 #include <sys/epoll.h>
 #endif
+#ifdef HAVE_SYS_EVENT_H
+#include <sys/event.h>
+#endif
 #ifdef HAVE_POLL_H
 #include <poll.h>
 #endif
@@ -68,11 +71,14 @@
  * useful to manually specify the used primitive.  If desired, just add a
  * define somewhere before this block.
  */
-#if defined(WAIT_USE_EPOLL) || defined(WAIT_USE_POLL) || \
-	defined(WAIT_USE_SELECT) || defined(WAIT_USE_WIN32)
+#if defined(WAIT_USE_EPOLL) || defined(WAIT_USE_KQUEUE) || \
+	defined(WAIT_USE_POLL) || defined(WAIT_USE_SELECT) || \
+	defined(WAIT_USE_WIN32)
 /* don't overwrite manual choice */
 #elif defined(HAVE_SYS_EPOLL_H)
 #define WAIT_USE_EPOLL
+#elif defined(HAVE_KQUEUE)
+#define WAIT_USE_KQUEUE
 #elif defined(HAVE_POLL)
 #define WAIT_USE_POLL
 #elif HAVE_SYS_SELECT_H
@@ -108,6 +114,10 @@ struct WaitEventSet
 	int			epoll_fd;
 	/* epoll_wait returns events in a user provided arrays, allocate once */
 	struct epoll_event *epoll_ret_events;
+#elif defined(WAIT_USE_KQUEUE)
+	int			kqueue_fd;
+	/* kevent returns events in a user provided arrays, allocate once */
+	struct kevent *kqueue_ret_events;
 #elif defined(WAIT_USE_POLL)
 	/* poll expects events to be waited on every poll() call, prepare once */
 	struct pollfd *pollfds;
@@ -137,6 +147,8 @@ static void drainSelfPipe(void);
 
 #if defined(WAIT_USE_EPOLL)
 static void WaitEventAdjustEpoll(WaitEventSet *set, WaitEvent *event, int action);
+#elif defined(WAIT_USE_KQUEUE)
+static void WaitEventAdjustKqueue(WaitEventSet *set, WaitEvent *event, int action);
 #elif defined(WAIT_USE_POLL)
 static void WaitEventAdjustPoll(WaitEventSet *set, WaitEvent *event);
 #elif defined(WAIT_USE_WIN32)
@@ -490,6 +502,8 @@ CreateWaitEventSet(MemoryContext context, int nevents)
 
 #if defined(WAIT_USE_EPOLL)
 	sz += sizeof(struct epoll_event) * nevents;
+#elif defined(WAIT_USE_KQUEUE)
+	sz += sizeof(struct kevent) * nevents;
 #elif defined(WAIT_USE_POLL)
 	sz += sizeof(struct pollfd) * nevents;
 #elif defined(WAIT_USE_WIN32)
@@ -508,6 +522,9 @@ CreateWaitEventSet(MemoryContext context, int nevents)
 #if defined(WAIT_USE_EPOLL)
 	set->epoll_ret_events = (struct epoll_event *) data;
 	data += sizeof(struct epoll_event) * nevents;
+#elif defined(WAIT_USE_KQUEUE)
+	set->kqueue_ret_events = (struct kevent *) data;
+	data += sizeof(struct kevent) * nevents;
 #elif defined(WAIT_USE_POLL)
 	set->pollfds = (struct pollfd *) data;
 	data += sizeof(struct pollfd) * nevents;
@@ -523,6 +540,10 @@ CreateWaitEventSet(MemoryContext context, int nevents)
 	set->epoll_fd = epoll_create(nevents);
 	if (set->epoll_fd < 0)
 		elog(ERROR, "epoll_create failed: %m");
+#elif defined(WAIT_USE_KQUEUE)
+	set->kqueue_fd = kqueue();
+	if (set->kqueue_fd < 0)
+		elog(ERROR, "kqueue failed: %m");
 #elif defined(WAIT_USE_WIN32)
 
 	/*
@@ -549,6 +570,8 @@ FreeWaitEventSet(WaitEventSet *set)
 {
 #if defined(WAIT_USE_EPOLL)
 	close(set->epoll_fd);
+#elif defined(WAIT_USE_KQUEUE)
+	close(set->kqueue_fd);
 #elif defined(WAIT_USE_WIN32)
 	WaitEvent  *cur_event;
 
@@ -653,6 +676,8 @@ AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd, Latch *latch,
 	/* perform wait primitive specific initialization, if needed */
 #if defined(WAIT_USE_EPOLL)
 	WaitEventAdjustEpoll(set, event, EPOLL_CTL_ADD);
+#elif defined(WAIT_USE_KQUEUE)
+	WaitEventAdjustKqueue(set, event, EV_ADD);
 #elif defined(WAIT_USE_POLL)
 	WaitEventAdjustPoll(set, event);
 #elif defined(WAIT_USE_SELECT)
@@ -711,6 +736,8 @@ ModifyWaitEvent(WaitEventSet *set, int pos, uint32 events, Latch *latch)
 
 #if defined(WAIT_USE_EPOLL)
 	WaitEventAdjustEpoll(set, event, EPOLL_CTL_MOD);
+#elif defined(WAIT_USE_KQUEUE)
+	WaitEventAdjustKqueue(set, event, EV_ADD);
 #elif defined(WAIT_USE_POLL)
 	WaitEventAdjustPoll(set, event);
 #elif defined(WAIT_USE_SELECT)
@@ -803,6 +830,71 @@ WaitEventAdjustPoll(WaitEventSet *set, WaitEvent *event)
 }
 #endif
 
+#if defined(WAIT_USE_KQUEUE)
+
+/*
+ * action can be EV_ADD or EV_DELETE.  EV_ADD is used for both adding and
+ * modifying, and EV_DELETE is not used yet.
+ */
+static void
+WaitEventAdjustKqueue(WaitEventSet *set, WaitEvent *event, int action)
+{
+	int rc;
+	struct kevent k_ev[2];
+	int count = 1;
+
+	k_ev[0].ident = event->fd;
+	k_ev[0].filter = 0;
+	k_ev[0].flags = action | EV_CLEAR;
+	k_ev[0].fflags = 0;
+	k_ev[0].data = 0;
+	k_ev[0].udata = event;
+
+	Assert(event->fd >= 0);
+	if (event->events == WL_LATCH_SET)
+	{
+		Assert(set->latch != NULL);
+		k_ev[0].filter = EVFILT_READ;
+	}
+	else if (event->events == WL_POSTMASTER_DEATH)
+	{
+		k_ev[0].filter = EVFILT_READ;
+	}
+	else
+	{
+		Assert(event->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE));
+
+		/*
+		 * If only one of read and write is request, we need only one kevent
+		 * object.
+		 */
+		if (event->events & WL_SOCKET_READABLE)
+			k_ev[0].filter = EVFILT_READ;
+		else
+			k_ev[0].filter = EVFILT_WRITE;
+
+		/*
+		 * We need to create a second kevent object if we need both.  The read
+		 * and write notifications will arrive separately.
+		 */
+		if ((event->events & WL_SOCKET_READABLE) &&
+			(event->events & WL_SOCKET_WRITEABLE))
+		{
+			++count;
+			k_ev[1] = k_ev[0];
+			k_ev[1].filter = EVFILT_WRITE;
+		}
+	}
+
+	rc = kevent(set->kqueue_fd, &k_ev[0], count, NULL, 0, NULL);
+	if (rc < 0)
+		ereport(ERROR,
+				(errcode_for_socket_access(),
+				 errmsg("kevent() failed: %m")));
+}
+
+#endif
+
 #if defined(WAIT_USE_WIN32)
 static void
 WaitEventAdjustWin32(WaitEventSet *set, WaitEvent *event)
@@ -1081,6 +1173,141 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 	return returned_events;
 }
 
+#elif defined(WAIT_USE_KQUEUE)
+
+/*
+ * Wait using FreeBSD kqueue(2)/kevent(2).  Also available on other BSD-family
+ * systems including MacOSX.
+ *
+ * This is the preferrable wait method for systems that have it, as several
+ * readiness notifications are delivered, without having to iterate through
+ * all of set->events.
+ */
+static int
+WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
+					  WaitEvent *occurred_events, int nevents)
+{
+	int			returned_events = 0;
+	int			rc;
+	WaitEvent  *cur_event;
+	struct kevent *cur_kqueue_event;
+	struct timespec timeout;
+	struct timespec *timeout_p;
+
+	if (cur_timeout < 0)
+		timeout_p = NULL;
+	else
+	{
+		timeout.tv_sec = cur_timeout / 1000;
+		timeout.tv_nsec = (cur_timeout % 1000) * 1000000;
+		timeout_p = &timeout;
+	}
+
+	/* Sleep */
+	rc = kevent(set->kqueue_fd, NULL, 0,
+				set->kqueue_ret_events, nevents,
+				timeout_p);
+
+	/* Check return code */
+	if (rc < 0)
+	{
+		/* EINTR is okay, otherwise complain */
+		if (errno != EINTR)
+		{
+			waiting = false;
+			ereport(ERROR,
+					(errcode_for_socket_access(),
+					 errmsg("kevent() failed while trying to wait: %m")));
+		}
+		return 0;
+	}
+	else if (rc == 0)
+	{
+		/* timeout exceeded */
+		return -1;
+	}
+
+	/*
+	 * At least one event occurred, iterate over the returned kqueue events
+	 * until they're either all processed, or we've returned all the events
+	 * the caller desired.
+	 */
+	for (cur_kqueue_event = set->kqueue_ret_events;
+		 cur_kqueue_event < (set->kqueue_ret_events + rc) &&
+		 returned_events < nevents;
+		 cur_kqueue_event++)
+	{
+		/* epoll's data pointer is set to the associated WaitEvent */
+		cur_event = (WaitEvent *) cur_kqueue_event->udata;
+
+		occurred_events->pos = cur_event->pos;
+		occurred_events->user_data = cur_event->user_data;
+		occurred_events->events = 0;
+
+		if (cur_event->events == WL_LATCH_SET &&
+			cur_kqueue_event->flags & (EV_EOF | EVFILT_READ))
+		{
+			/* There's data in the self-pipe, clear it. */
+			drainSelfPipe();
+
+			if (set->latch->is_set)
+			{
+				occurred_events->fd = PGINVALID_SOCKET;
+				occurred_events->events = WL_LATCH_SET;
+				occurred_events++;
+				returned_events++;
+			}
+		}
+		else if (cur_event->events == WL_POSTMASTER_DEATH &&
+				 cur_kqueue_event->flags & (EVFILT_READ | EV_EOF))
+		{
+			/*
+			 * We expect an EV_EOF when the remote end is closed, but
+			 * because we don't expect the pipe to become readable or to have
+			 * any errors either, treat those cases as postmaster death, too.
+			 *
+			 * As explained in the WAIT_USE_SELECT implementation, select(2)
+			 * may spuriously return. Be paranoid about that here too, a
+			 * spurious WL_POSTMASTER_DEATH would be painful.
+			 */
+			if (!PostmasterIsAlive())
+			{
+				occurred_events->fd = PGINVALID_SOCKET;
+				occurred_events->events = WL_POSTMASTER_DEATH;
+				occurred_events++;
+				returned_events++;
+			}
+		}
+		else if (cur_event->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE))
+		{
+			Assert(cur_event->fd >= 0);
+
+			if ((cur_event->events & WL_SOCKET_READABLE) &&
+				(cur_kqueue_event->flags & (EV_EOF | EVFILT_READ)))
+			{
+				/* readable, or EOF */
+				occurred_events->events |= WL_SOCKET_READABLE;
+			}
+
+			if ((cur_event->events & WL_SOCKET_WRITEABLE) &&
+				(cur_kqueue_event->flags & (EV_EOF | EVFILT_WRITE)))
+			{
+				/* writable, or EOF */
+				occurred_events->events |= WL_SOCKET_WRITEABLE;
+			}
+
+			if (occurred_events->events != 0)
+			{
+				occurred_events->fd = cur_event->fd;
+				occurred_events++;
+				returned_events++;
+			}
+		}
+	}
+
+	return returned_events;
+}
+
 #elif defined(WAIT_USE_POLL)
 
 /*
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index c72635c..e319f1d 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -279,6 +279,9 @@
 /* Define to 1 if you have isinf(). */
 #undef HAVE_ISINF
 
+/* Define to 1 if you have the `kqueue' function. */
+#undef HAVE_KQUEUE
+
 /* Define to 1 if you have the <langinfo.h> header file. */
 #undef HAVE_LANGINFO_H
 
@@ -533,6 +536,9 @@
 /* Define to 1 if you have the <sys/epoll.h> header file. */
 #undef HAVE_SYS_EPOLL_H
 
+/* Define to 1 if you have the <sys/event.h> header file. */
+#undef HAVE_SYS_EVENT_H
+
 /* Define to 1 if you have the <sys/ioctl.h> header file. */
 #undef HAVE_SYS_IOCTL_H

#90

Noah Misch

noah@leadboat.com

over 9 years ago

In reply to: Васильев Дмитрий (#1)

Re: Performance degradation in commit ac1d794

On Fri, Dec 25, 2015 at 08:08:15PM +0300, Васильев Дмитрий wrote:

I suddenly found commit ac1d794 gives up to 3 times performance degradation.

I tried to run pgbench -s 1000 -j 48 -c 48 -S -M prepared on 70 CPU-core
machine:
commit ac1d794 gives me 363,474 tps
and previous commit a05dc4d gives me 956,146
and master( 3d0c50f ) with revert ac1d794 gives me 969,265

[This is a generic notification.]

The above-described topic is currently a PostgreSQL 9.6 open item. Robert,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
9.6 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1]/messages/by-id/20160527025039.GA447393@tornado.leadboat.com and send a status update within 72 hours of this
message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping 9.6rc1. Consequently, I will appreciate your
efforts toward speedy resolution. Thanks.

[1]: /messages/by-id/20160527025039.GA447393@tornado.leadboat.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#91

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Noah Misch (#90)

Re: Performance degradation in commit ac1d794

On Sun, May 29, 2016 at 1:40 AM, Noah Misch <noah@leadboat.com> wrote:

On Fri, Dec 25, 2015 at 08:08:15PM +0300, Васильев Дмитрий wrote:

I suddenly found commit ac1d794 gives up to 3 times performance degradation.

I tried to run pgbench -s 1000 -j 48 -c 48 -S -M prepared on 70 CPU-core
machine:
commit ac1d794 gives me 363,474 tps
and previous commit a05dc4d gives me 956,146
and master( 3d0c50f ) with revert ac1d794 gives me 969,265

[This is a generic notification.]

The above-described topic is currently a PostgreSQL 9.6 open item. Robert,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
9.6 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1] and send a status update within 72 hours of this
message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping 9.6rc1. Consequently, I will appreciate your
efforts toward speedy resolution. Thanks.

So, the reason this is back on the open items list is that Mithun Cy
re-reported this problem in:

/messages/by-id/CAD__OuhPmc6XH=wYRm_+Q657yQE88DakN4=Ybh2oveFasHkoeA@mail.gmail.com

When I saw that, I moved this from CLOSE_WAIT back to open. However,
subsequently, Ashutosh Sharma posted this, which suggests (not
conclusively) that in fact the problem has been fixed:

/messages/by-id/CAE9k0PkFEhVq-Zg4MH0bZ-zt_oE5PAS6dAuxRCXwX9kEVWceag@mail.gmail.com

What I *think* is going on here is:

- ac1d794 lowered performance
- backend_flush_after with a non-zero default lowered performance with
a vengeance
- 98a64d0 repaired the damage done by ac1d794, or much of it, but
Mithun couldn't see it in his benchmarks because backend_flush_after>0
is so bad

That could be wrong, but I haven't seen any evidence that it's wrong.
So I'm inclined to say we should just move this open item back to the
CLOSE_WAIT list (adding a link to this email to explain why we did
so). Does that work for you?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#92

Noah Misch

noah@leadboat.com

over 9 years ago

In reply to: Robert Haas (#91)

Re: Performance degradation in commit ac1d794

On Tue, May 31, 2016 at 10:09:05PM -0400, Robert Haas wrote:

On Sun, May 29, 2016 at 1:40 AM, Noah Misch <noah@leadboat.com> wrote:

On Fri, Dec 25, 2015 at 08:08:15PM +0300, Васильев Дмитрий wrote:

I suddenly found commit ac1d794 gives up to 3 times performance degradation.

I tried to run pgbench -s 1000 -j 48 -c 48 -S -M prepared on 70 CPU-core
machine:
commit ac1d794 gives me 363,474 tps
and previous commit a05dc4d gives me 956,146
and master( 3d0c50f ) with revert ac1d794 gives me 969,265

[This is a generic notification.]

The above-described topic is currently a PostgreSQL 9.6 open item. Robert,
since you committed the patch believed to have created it, you own this open
item. If some other commit is more relevant or if this does not belong as a
9.6 open item, please let us know. Otherwise, please observe the policy on
open item ownership[1] and send a status update within 72 hours of this
message. Include a date for your subsequent status update. Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping 9.6rc1. Consequently, I will appreciate your
efforts toward speedy resolution. Thanks.

So, the reason this is back on the open items list is that Mithun Cy
re-reported this problem in:

/messages/by-id/CAD__OuhPmc6XH=wYRm_+Q657yQE88DakN4=Ybh2oveFasHkoeA@mail.gmail.com

When I saw that, I moved this from CLOSE_WAIT back to open. However,
subsequently, Ashutosh Sharma posted this, which suggests (not
conclusively) that in fact the problem has been fixed:

/messages/by-id/CAE9k0PkFEhVq-Zg4MH0bZ-zt_oE5PAS6dAuxRCXwX9kEVWceag@mail.gmail.com

What I *think* is going on here is:

- ac1d794 lowered performance
- backend_flush_after with a non-zero default lowered performance with
a vengeance
- 98a64d0 repaired the damage done by ac1d794, or much of it, but
Mithun couldn't see it in his benchmarks because backend_flush_after>0
is so bad

Ashutosh Sharma's measurements do bolster that conclusion.

That could be wrong, but I haven't seen any evidence that it's wrong.
So I'm inclined to say we should just move this open item back to the
CLOSE_WAIT list (adding a link to this email to explain why we did
so). Does that work for you?

That works for me.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#93

Tom Lane

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: Noah Misch (#92)

Re: Performance degradation in commit ac1d794

Noah Misch <noah@leadboat.com> writes:

On Tue, May 31, 2016 at 10:09:05PM -0400, Robert Haas wrote:

What I *think* is going on here is:
- ac1d794 lowered performance
- backend_flush_after with a non-zero default lowered performance with
a vengeance
- 98a64d0 repaired the damage done by ac1d794, or much of it, but
Mithun couldn't see it in his benchmarks because backend_flush_after>0
is so bad

Ashutosh Sharma's measurements do bolster that conclusion.

That could be wrong, but I haven't seen any evidence that it's wrong.
So I'm inclined to say we should just move this open item back to the
CLOSE_WAIT list (adding a link to this email to explain why we did
so). Does that work for you?

That works for me.

Can we make a note to re-examine this after the backend_flush_after
business is resolved? Or at least get Mithun to redo his benchmarks
with backend_flush_after set to zero?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#94

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Tom Lane (#93)

Re: Performance degradation in commit ac1d794

On Tue, May 31, 2016 at 10:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Noah Misch <noah@leadboat.com> writes:

On Tue, May 31, 2016 at 10:09:05PM -0400, Robert Haas wrote:

What I *think* is going on here is:
- ac1d794 lowered performance
- backend_flush_after with a non-zero default lowered performance with
a vengeance
- 98a64d0 repaired the damage done by ac1d794, or much of it, but
Mithun couldn't see it in his benchmarks because backend_flush_after>0
is so bad

Ashutosh Sharma's measurements do bolster that conclusion.

That could be wrong, but I haven't seen any evidence that it's wrong.
So I'm inclined to say we should just move this open item back to the
CLOSE_WAIT list (adding a link to this email to explain why we did
so). Does that work for you?

That works for me.

Can we make a note to re-examine this after the backend_flush_after
business is resolved? Or at least get Mithun to redo his benchmarks
with backend_flush_after set to zero?

Ashutosh Sharma already did pretty much that test in the email to
which I linked.

(Ashutosh Sharma and Mithun CY work in the same office and have access
to the same hardware.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers