Reducing WaitEventSet syscall churn

Started by Thomas Munroabout 6 years ago15 messageshackers
Jump to latest
#1Thomas Munro
thomas.munro@gmail.com

Hi,

Here are some patches to get rid of frequent system calls.

0001 changes all qualifying WaitLatch() calls to use a new function
WaitMyLatch() that reuses a common WaitEventSet. That's for callers
that only want to wait for their own latch or an optional timeout,
with automatic exit-on-postmaster-death.

0002 changes condition_variable.c to use WaitMyLatch(), instead of its
own local thing like that. Perhaps this makes up for the use of the
extra fd consumed by 0001.

0003 changes pgstat.c to use its own local reusable WaitEventSet.

To see what I'm talking about, try tracing a whole cluster with eg
strace/truss/dtruss -f postgres -D pgdata. This applies to Linux
systems, or BSD/macOS systems with the nearby kqueue patch applied.
On systems that fall back to poll(), there aren't any setup/teardown
syscalls.

Attachments:

0001-Add-WaitMyLatch-to-replace-many-WaitLatch-calls.patchapplication/octet-stream; name=0001-Add-WaitMyLatch-to-replace-many-WaitLatch-calls.patchDownload+58-66
0002-Use-WaitMyLatch-for-condition-variables.patchapplication/octet-stream; name=0002-Use-WaitMyLatch-for-condition-variables.patchDownload+1-24
0003-Introduce-a-reusable-WaitEventSet-for-the-stats-coll.patchapplication/octet-stream; name=0003-Introduce-a-reusable-WaitEventSet-for-the-stats-coll.patchDownload+14-11
#2Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#1)
Re: Reducing WaitEventSet syscall churn

On Tue, Jan 21, 2020 at 1:45 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Here are some patches to get rid of frequent system calls.

Here is one more case that I was sitting on because I wasn't sure how
to do it: walreceiver.c. To make that work, libpq needs to be able to
tell you when the socket has changed, which I did with a counter that
is exposed to client code in patch 0004. The walreceiver change in
0005 works (trace the system calls on walreceiver to see the
difference), but perhaps we can come up with a better way to code it
so that eg logical/worker.c doesn't finish up duplicating the logic.
Thoughts?

Attachments:

0001-Add-WaitMyLatch-to-replace-many-WaitLatch-calls.patchapplication/octet-stream; name=0001-Add-WaitMyLatch-to-replace-many-WaitLatch-calls.patchDownload+58-66
0002-Use-WaitMyLatch-for-condition-variables.patchapplication/octet-stream; name=0002-Use-WaitMyLatch-for-condition-variables.patchDownload+1-24
0003-Introduce-a-reusable-WaitEventSet-for-the-stats-coll.patchapplication/octet-stream; name=0003-Introduce-a-reusable-WaitEventSet-for-the-stats-coll.patchDownload+14-11
0004-libpq-Add-PQsocketChangeCount-to-advertise-socket-ch.patchapplication/octet-stream; name=0004-libpq-Add-PQsocketChangeCount-to-advertise-socket-ch.patchDownload+30-1
0005-Reuse-a-WaitEventSet-in-walreceiver.patchapplication/octet-stream; name=0005-Reuse-a-WaitEventSet-in-walreceiver.patchDownload+42-29
#3Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#2)
Re: Reducing WaitEventSet syscall churn

On Sat, Feb 8, 2020 at 10:00 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Tue, Jan 21, 2020 at 1:45 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Here are some patches to get rid of frequent system calls.

Here is one more case that I was sitting on because I wasn't sure how
to do it: walreceiver.c. To make that work, libpq needs to be able to
tell you when the socket has changed, which I did with a counter that
is exposed to client code in patch 0004. The walreceiver change in
0005 works (trace the system calls on walreceiver to see the
difference), but perhaps we can come up with a better way to code it
so that eg logical/worker.c doesn't finish up duplicating the logic.
Thoughts?

(To be clear: I know the 0005 patch doesn't clean up after itself in
various cases, it's for discussion only to see if others have ideas
about how to structure things to suit various potential users of
libpqwalreceiver.so.)

#4Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#3)
Re: Reducing WaitEventSet syscall churn

On Sat, Feb 8, 2020 at 10:15 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Here are some patches to get rid of frequent system calls.

Here's a new version of this patch set. It gets rid of all temporary
WaitEventSets except a couple I mentioned in another thread[1]/messages/by-id/CA+hUKGK=m9dLrq42oWQ4XfK9iDjGiZVwpQ1HkHrAPfG7Kh681g@mail.gmail.com.
WaitLatch() uses CommonWaitSet, and calls to WaitLatchOrSocket() are
replaced by either the existing FeBeWaitSet (walsender, gssapi/openssl
auth are also candidates) or a special purpose long lived WaitEventSet
(replication, postgres_fdw, pgstats). It passes make check-world with
WAIT_USE_POLL, WAIT_USE_KQUEUE, WAIT_USE_EPOLL, all with and without
-DEXEC_BACKEND, and make check with WAIT_USE_WIN32 (Appveyor).

0001: "Don't use EV_CLEAR for kqueue events."

This fixes a problem in the kqueue implementation that only shows up
once you switch to long lived WaitEventSets. It needs to be
level-triggered like the other implementations, for example because
there's a place in the recovery tests where we wait twice in a row
without trying to do I/O in between. (This is a bit like commit
3b790d256f8 that fixed a similar problem on Windows.)

0002: "Use a long lived WaitEventSet for WaitLatch()."

In the last version, I had a new function WaitMyLatch(), but that
didn't help with recoveryWakeupLatch. Switching between latches
doesn't require a syscall, so I didn't want to have a separate WES and
function just for that. So I went back to using plain old
WaitLatch(), and made it "modify" the latch every time before waiting
on CommonWaitSet. An alternative would be to get rid of the concept
of latches other than MyLatch, and change the function to
WaitMyLatch(). A similar thing happens for exit_on_postmaster_death,
for which I didn't want to have a separate WES, so I just set that
flag every time. Thoughts?

0003: "Use regular WaitLatch() for condition variables."

That mechanism doesn't need its own WES anymore.

0004: "Introduce RemoveWaitEvent()."

We'll need that to be able to handle connections that are closed and
reopened under the covers by libpq (replication, postgres_fdw). We
also wanted this on a couple of other threads for multiplexing FDWs,
to be able to adjust the wait set dynamically for a proposed async
Append feature.

The implementation is a little naive, and leaves holes in the
"pollfds" and "handles" arrays (for poll() and win32 implementations).
That could be improved with a bit more footwork in a later patch.

XXX The Windows version is based on reading documentation. I'd be
very interested to know if check-world passes (especially
contrib/postgres_fdw and src/test/recovery). Unfortunately my
appveyor.yml fu is not yet strong enough.

0005: "libpq: Add PQsocketChangeCount to advertise socket changes."

To support a long lived WES, libpq needs a way tell us when the socket
changes underneath our feet. This is the simplest thing I could think
of; better ideas welcome.

0006: "Reuse a WaitEventSet in libpqwalreceiver.c."

Rather than having all users of libpqwalreceiver.c deal with the
complicated details of wait set management, have libpqwalreceiver
expose a waiting interface that understands socket changes.

Unfortunately, I couldn't figure out how to use CommonWaitSet for this
(ie adding and removing sockets to that as required), due to
complications with the bookkeeping required to provide the fd_closed
flag to RemoveWaitEvent(). So it creates its own internal long lived
WaitEventSet.

0007: "Use a WaitEventSet for postgres_fdw."

Create a single WaitEventSet and use it for all FDW connections. By
having our own dedicated WES, we can do the bookkeeping required to
know when sockets have been closed or need to removed from kernel wait
sets explicitly (which would be much harder to achieve if
CommonWaitSet were to be used like that; you need to know when sockets
are closed by other code, so you can provide fd_closed to
RemoveWaitEvent()).

Concretely, if you use just one postgres_fdw connection, you'll see
just epoll_wait()/kevent() calls for waits, but whever you switch
between different connections, you'll see eg EPOLL_DEL/EV_DELETE
followed by EPOLL_ADD/EV_ADD when the set is adjusted (in the kqueue
implementation these could be collapse into the following wait, but I
haven't done the work for that). An alternative would be to have one
WES per FDW connection, but that seemed wasteful of file descriptors.

0008: "Use WL_EXIT_ON_PM_DEATH in FeBeWaitSet."

The FATAL message you get if you happen to be waiting for IO rather
than waiting somewhere else seems arbitrarily different. By switching
to a standard automatic exit, it opens the possibility of using
FeBeWaitSet in a couple more places that would otherwise need to
create their own WES (see also [1]/messages/by-id/CA+hUKGK=m9dLrq42oWQ4XfK9iDjGiZVwpQ1HkHrAPfG7Kh681g@mail.gmail.com). Thoughts?

0009: "Use FeBeWaitSet for walsender.c."

Enabled by 0008.

0010: "Introduce a WaitEventSet for the stats collector."

[1]: /messages/by-id/CA+hUKGK=m9dLrq42oWQ4XfK9iDjGiZVwpQ1HkHrAPfG7Kh681g@mail.gmail.com

Attachments:

0001-Don-t-use-EV_CLEAR-for-kqueue-events.patchtext/x-patch; charset=US-ASCII; name=0001-Don-t-use-EV_CLEAR-for-kqueue-events.patchDownload+2-3
0002-Use-a-long-lived-WaitEventSet-for-WaitLatch.patchtext/x-patch; charset=US-ASCII; name=0002-Use-a-long-lived-WaitEventSet-for-WaitLatch.patchDownload+61-16
0003-Use-WaitLatch-for-condition-variables.patchtext/x-patch; charset=US-ASCII; name=0003-Use-WaitLatch-for-condition-variables.patchDownload+5-24
0004-Introduce-RemoveWaitEvent.patchtext/x-patch; charset=US-ASCII; name=0004-Introduce-RemoveWaitEvent.patchDownload+120-15
0005-libpq-Add-PQsocketChangeCount-to-report-socket-chang.patchtext/x-patch; charset=US-ASCII; name=0005-libpq-Add-PQsocketChangeCount-to-report-socket-chang.patchDownload+30-1
0006-Reuse-a-WaitEventSet-in-libpqwalreceiver.c.patchtext/x-patch; charset=US-ASCII; name=0006-Reuse-a-WaitEventSet-in-libpqwalreceiver.c.patchDownload+101-64
0007-Use-a-WaitEventSet-for-postgres_fdw.patchtext/x-patch; charset=US-ASCII; name=0007-Use-a-WaitEventSet-for-postgres_fdw.patchDownload+88-11
0008-Use-WL_EXIT_ON_PM_DEATH-in-FeBeWaitSet.patchtext/x-patch; charset=US-ASCII; name=0008-Use-WL_EXIT_ON_PM_DEATH-in-FeBeWaitSet.patchDownload+1-30
0009-Use-FeBeWaitSet-for-walsender.c.patchtext/x-patch; charset=US-ASCII; name=0009-Use-FeBeWaitSet-for-walsender.c.patchDownload+15-18
0010-Introduce-a-WaitEventSet-for-the-stats-collector.patchtext/x-patch; charset=US-ASCII; name=0010-Introduce-a-WaitEventSet-for-the-stats-collector.patchDownload+14-11
#5Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Thomas Munro (#4)
Re: Reducing WaitEventSet syscall churn

Hello.

I looked this.

At Thu, 27 Feb 2020 12:17:45 +1300, Thomas Munro <thomas.munro@gmail.com> wrote in

On Sat, Feb 8, 2020 at 10:15 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Here are some patches to get rid of frequent system calls.

Here's a new version of this patch set. It gets rid of all temporary
WaitEventSets except a couple I mentioned in another thread[1].
WaitLatch() uses CommonWaitSet, and calls to WaitLatchOrSocket() are
replaced by either the existing FeBeWaitSet (walsender, gssapi/openssl
auth are also candidates) or a special purpose long lived WaitEventSet
(replication, postgres_fdw, pgstats). It passes make check-world with
WAIT_USE_POLL, WAIT_USE_KQUEUE, WAIT_USE_EPOLL, all with and without
-DEXEC_BACKEND, and make check with WAIT_USE_WIN32 (Appveyor).

0001: "Don't use EV_CLEAR for kqueue events."

This fixes a problem in the kqueue implementation that only shows up
once you switch to long lived WaitEventSets. It needs to be
level-triggered like the other implementations, for example because
there's a place in the recovery tests where we wait twice in a row
without trying to do I/O in between. (This is a bit like commit
3b790d256f8 that fixed a similar problem on Windows.)

It looks fine in the light of document of kqueue.

0002: "Use a long lived WaitEventSet for WaitLatch()."

In the last version, I had a new function WaitMyLatch(), but that
didn't help with recoveryWakeupLatch. Switching between latches
doesn't require a syscall, so I didn't want to have a separate WES and
function just for that. So I went back to using plain old
WaitLatch(), and made it "modify" the latch every time before waiting
on CommonWaitSet. An alternative would be to get rid of the concept
of latches other than MyLatch, and change the function to
WaitMyLatch(). A similar thing happens for exit_on_postmaster_death,
for which I didn't want to have a separate WES, so I just set that
flag every time. Thoughts?

It is surely an improvement from that we create a full-fledged WES
every time. The name CommonWaitSet gives an impression as if it is
used for a variety of waitevent sets, but it is used only by
WaitLatch. So I would name it LatchWaitSet. With that name I won't be
surprised by that the function is pointing WL_LATCH_SET by index 0
without any explanation when calling ModifyWaitSet.

@@ -700,7 +739,11 @@ FreeWaitEventSet(WaitEventSet *set)
 	ReleaseExternalFD();
 #elif defined(WAIT_USE_KQUEUE)
 	close(set->kqueue_fd);
-	ReleaseExternalFD();
+	if (set->kqueue_fd >= 0)
+	{
+		close(set->kqueue_fd);
+		ReleaseExternalFD();
+	}

Did you forget to remove the close() outside the if block?
Don't we need the same amendment for epoll_fd with kqueue_fd?

WaitLatch is defined as "If the latch is already set (and WL_LATCH_SET
is given), the function returns immediately.". But now the function
reacts to latch even if WL_LATCH_SET is not set. I think actually it
is alwys set so I think we need to modify Assert and function comment
following the change.

0003: "Use regular WaitLatch() for condition variables."

That mechanism doesn't need its own WES anymore.

Looks fine.

0004: "Introduce RemoveWaitEvent()."

We'll need that to be able to handle connections that are closed and
reopened under the covers by libpq (replication, postgres_fdw). We
also wanted this on a couple of other threads for multiplexing FDWs,
to be able to adjust the wait set dynamically for a proposed async
Append feature.

The implementation is a little naive, and leaves holes in the
"pollfds" and "handles" arrays (for poll() and win32 implementations).
That could be improved with a bit more footwork in a later patch.

XXX The Windows version is based on reading documentation. I'd be
very interested to know if check-world passes (especially
contrib/postgres_fdw and src/test/recovery). Unfortunately my
appveyor.yml fu is not yet strong enough.

I didn't find the documentation about INVALID_HANDLE_VALUE in
lpHandles. Could you give me the URL for that?

I didn't run recoverycheck because because I couldn't install IPC::Run
for ActivePerl.. But contribcheck succeeded.

+	for (int i = 0; i < nevents; ++i)
+		set->handles[i + 1] = INVALID_HANDLE_VALUE;

It accesses set->handles[nevents], which is overrun.

+	/* Set up the free list. */
+	for (int i = 0; i < nevents; ++i)
+		set->events[i].next_free = i + 1;
+	set->events[nevents - 1].next_free = -1;

It sets the last element twice. (harmless but useless).

set->handles = (HANDLE) data;
data += MAXALIGN(sizeof(HANDLE) * nevents);

It is not an issue of thie patch, but does hadles need nevents + 1
elements?

WaitEventSetSize is not checking its parameter range.

I'l continue reviewing in later mail.

0005: "libpq: Add PQsocketChangeCount to advertise socket changes."

....

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#6Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Thomas Munro (#4)
Re: Reducing WaitEventSet syscall churn

Hello.

At Tue, 10 Mar 2020 08:19:24 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
me> I'l continue reviewing in later mail.
me>
me> > 0005: "libpq: Add PQsocketChangeCount to advertise socket changes."
me> ....

At Thu, 27 Feb 2020 12:17:45 +1300, Thomas Munro <thomas.munro@gmail.com> wrote
in

On Sat, Feb 8, 2020 at 10:15 AM Thomas Munro <thomas.munro@gmail.com> wrote:
0005: "libpq: Add PQsocketChangeCount to advertise socket changes."

To support a long lived WES, libpq needs a way tell us when the socket
changes underneath our feet. This is the simplest thing I could think
of; better ideas welcome.

I think at least windows is the reason for not just detecting the
change of the value of fd. Instead of counting disconnection, we
could use event libpq-event.

QregisterEventProc returns false for all of bad-parameter,
already-registered, out-of-memory and proc-rejection. I don't think it
is usable interface so the attached 0005 patch fixes that. (but I
found it not necessarily needed after making 0007, but I included it
as a proposal separate from this patch set. It's not including the
corresponding doc fix.).

0006: "Reuse a WaitEventSet in libpqwalreceiver.c."

Rather than having all users of libpqwalreceiver.c deal with the
complicated details of wait set management, have libpqwalreceiver
expose a waiting interface that understands socket changes.

Looks reasonable. The attached 0006 and 0007 are a possible
replacement if we use libpq-event.

Unfortunately, I couldn't figure out how to use CommonWaitSet for this
(ie adding and removing sockets to that as required), due to
complications with the bookkeeping required to provide the fd_closed
flag to RemoveWaitEvent(). So it creates its own internal long lived
WaitEventSet.

Agreed since they are used different way. But with the attached closed
connection is marked as wes_socket_position = -1.

0007: "Use a WaitEventSet for postgres_fdw."

Continues..

The attached are:
0001-0004 Not changed
0005 Fix interface of PQregisterEventProc
0006 Add new libpq event for this use.
0007 Another version of "0006 Reuse a WaitEventSet in
libpqwalreceiver.c" based on libpq event.
0008-0011 Not changed (old 0007-0010, blindly appended)

passed the regression (includeing TAP recovery test) up to here.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-Don-t-use-EV_CLEAR-for-kqueue-events.patchtext/x-patch; charset=us-asciiDownload+2-3
0002-Use-a-long-lived-WaitEventSet-for-WaitLatch.patchtext/x-patch; charset=us-asciiDownload+61-16
0003-Use-WaitLatch-for-condition-variables.patchtext/x-patch; charset=us-asciiDownload+5-24
0004-Introduce-RemoveWaitEvent.patchtext/x-patch; charset=us-asciiDownload+120-15
0005-Fix-interface-of-PQregisterEventProc.patchtext/x-patch; charset=us-asciiDownload+39-11
0006-Add-new-libpq-event-PGEVT_CONNDISCONNECTION.patchtext/x-patch; charset=us-asciiDownload+20-1
0007-Reuse-a-WaitEventSet-in-libpqwalreceiver.c.patchtext/x-patch; charset=us-asciiDownload+127-36
0008-Use-a-WaitEventSet-for-postgres_fdw.patchtext/x-patch; charset=us-asciiDownload+88-11
0009-Use-WL_EXIT_ON_PM_DEATH-in-FeBeWaitSet.patchtext/x-patch; charset=us-asciiDownload+1-30
0010-Use-FeBeWaitSet-for-walsender.c.patchtext/x-patch; charset=us-asciiDownload+15-18
0011-Introduce-a-WaitEventSet-for-the-stats-collector.patchtext/x-patch; charset=us-asciiDownload+14-11
#7Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Thomas Munro (#2)
Re: Reducing WaitEventSet syscall churn

At Fri, 13 Mar 2020 16:21:13 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

0007: "Use a WaitEventSet for postgres_fdw."

Continues..

At Thu, 27 Feb 2020 12:17:45 +1300, Thomas Munro <thomas.munro@gmail.com> wrote in

0007: "Use a WaitEventSet for postgres_fdw."

Create a single WaitEventSet and use it for all FDW connections. By
having our own dedicated WES, we can do the bookkeeping required to
know when sockets have been closed or need to removed from kernel wait
sets explicitly (which would be much harder to achieve if
CommonWaitSet were to be used like that; you need to know when sockets
are closed by other code, so you can provide fd_closed to
RemoveWaitEvent()).

It is straightforward and looks fine. If we use the libpq-event-based
solution instead of using the changecount, pgfdw_wait_for_socket and
disconnect_pg_server would be a bit simpler.

Concretely, if you use just one postgres_fdw connection, you'll see
just epoll_wait()/kevent() calls for waits, but whever you switch
between different connections, you'll see eg EPOLL_DEL/EV_DELETE
followed by EPOLL_ADD/EV_ADD when the set is adjusted (in the kqueue
implementation these could be collapse into the following wait, but I

Such syscall sequences is shorter than or equal to what we issue now,
so this patch is still an improvement.

haven't done the work for that). An alternative would be to have one
WES per FDW connection, but that seemed wasteful of file descriptors.

In the multi-connection case, we can save both fds and syscalls by one
WaitEventSet containing all connection fds together. WaitEventSetWait
should take event mask parameter or ModifyWaitEvent should set the
mask in that case. The event is now completely level-triggered so we
can safely ignore unwanted events.

0008: "Use WL_EXIT_ON_PM_DEATH in FeBeWaitSet."

The FATAL message you get if you happen to be waiting for IO rather
than waiting somewhere else seems arbitrarily different. By switching
to a standard automatic exit, it opens the possibility of using
FeBeWaitSet in a couple more places that would otherwise need to
create their own WES (see also [1]). Thoughts?

Don't we really need any message on backend disconnection due to
postmaster death? As for authentication code, it seems to me the
rationale for not writing the log is that the connection has not been
established at the time. (That depends on what we think the
"connection" is.)

And I suppose that the reason for the authentication logic ignoring
signals is that clients expect that authentication should be completed
as far as possible.

0009: "Use FeBeWaitSet for walsender.c."

Enabled by 0008.

It works and doesn't change behavior. But I found it a bit difficult
to find what events the WaitEventSetWait waits. Maybe a comment at
the caller sites would be sufficient. I think any comment about the
bare number "0" as the event position of ModifyWaitEvent is also
needed.

0010: "Introduce a WaitEventSet for the stats collector."

This looks fine. The variable event is defined outside its effective
scope but almost all of its function variables the same way.

By the way I found that pqcomm.c uses -1 instead of PGINVALID_SOCKET
for AddWaitEventToSet.

[1] /messages/by-id/CA+hUKGK=m9dLrq42oWQ4XfK9iDjGiZVwpQ1HkHrAPfG7Kh681g@mail.gmail.com

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#8Daniel Gustafsson
daniel@yesql.se
In reply to: Kyotaro Horiguchi (#6)
Re: Reducing WaitEventSet syscall churn

On 13 Mar 2020, at 08:21, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

The attached are:
0001-0004 Not changed
0005 Fix interface of PQregisterEventProc
0006 Add new libpq event for this use.
0007 Another version of "0006 Reuse a WaitEventSet in
libpqwalreceiver.c" based on libpq event.
0008-0011 Not changed (old 0007-0010, blindly appended)

Since 0001 has been applied already in 9b8aa0929390a, the patchtester is unable
to make heads or tails with trying this patchset. Can you please submit a
rebased version without the already applied changes?

cheers ./daniel

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Daniel Gustafsson (#8)
Re: Reducing WaitEventSet syscall churn

Daniel Gustafsson <daniel@yesql.se> writes:

On 13 Mar 2020, at 08:21, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
The attached are:
0001-0004 Not changed
0005 Fix interface of PQregisterEventProc
0006 Add new libpq event for this use.
0007 Another version of "0006 Reuse a WaitEventSet in
libpqwalreceiver.c" based on libpq event.
0008-0011 Not changed (old 0007-0010, blindly appended)

Since 0001 has been applied already in 9b8aa0929390a, the patchtester is unable
to make heads or tails with trying this patchset. Can you please submit a
rebased version without the already applied changes?

While I've not really looked at this patchset, I did happen to notice
0005, and I think that's utterly unacceptable as written. You can't
break API/ABI on a released libpq function. I suppose the patch is
assuming that an enum return value is ABI-equivalent to int, but that
assumption is faulty. Moreover, what's the point? None of the later
patches require this AFAICS. (The patch could probably be salvaged
ABI-wise by making the error codes be integer #define's not an enum,
but I fail to see the point of changing this at all. I also don't
much like the idea of allowing callers to assume that there is a fixed
set of possible failure conditions for PQregisterEventProc. If we
wanted to return error details, it'd likely be better to say that an
error message is left in conn->errorMessage.)

0006 is an even more egregious ABI break; you can't renumber existing
enum values without breaking applications. That in itself could be
fixed by putting the new value at the end. But I'd still object very
strongly to 0006, because I do not think that it's acceptable to
have pqDropConnection correspond to an application-visible event.
We use that all over the place for cases that should not be
application-visible, for example when deciding that a connection attempt
has failed and moving on to the next candidate host. We already have
PGEVT_CONNRESET and PGEVT_CONNDESTROY as application-visible connection
state change events, and I don't see why those aren't sufficient.

(BTW, even if these weren't ABI breaks, where are the documentation
changes to go with them?)

regards, tom lane

#10Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#9)
Re: Reducing WaitEventSet syscall churn

On Sun, Jul 12, 2020 at 7:22 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

[...complaints about 0005 and 0006...] We already have
PGEVT_CONNRESET and PGEVT_CONNDESTROY as application-visible connection
state change events, and I don't see why those aren't sufficient.

I think Horiguchi-san's general idea of using event callbacks for this
sounds much more promising than my earlier idea of exposing a change
counter (that really was terrible). If it can be done with existing
events then that's even better. Perhaps he and/or I can look into
that for the next CF.

In the meantime, here's a rebase of the more straightforward patches
in the stack. These are the ones that deal only with fixed sets of
file descriptors, and they survive check-world on Linux,
Linux+EXEC_BACKEND (with ASLR disabled) and FreeBSD, and at least
check on macOS and Windows (my CI recipes need more work to get
check-world working on those two). There's one user-visible change
that I'd appreciate feedback on: I propose to drop the FATAL error
when the postmaster goes away, to make things more consistent. See
below for more on that.

Responding to earlier review from Horiguchi-san:

On Tue, Mar 10, 2020 at 12:20 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

0002: "Use a long lived WaitEventSet for WaitLatch()."

In the last version, I had a new function WaitMyLatch(), but that
didn't help with recoveryWakeupLatch. Switching between latches
doesn't require a syscall, so I didn't want to have a separate WES and
function just for that. So I went back to using plain old
WaitLatch(), and made it "modify" the latch every time before waiting
on CommonWaitSet. An alternative would be to get rid of the concept
of latches other than MyLatch, and change the function to
WaitMyLatch(). A similar thing happens for exit_on_postmaster_death,
for which I didn't want to have a separate WES, so I just set that
flag every time. Thoughts?

It is surely an improvement from that we create a full-fledged WES
every time. The name CommonWaitSet gives an impression as if it is
used for a variety of waitevent sets, but it is used only by
WaitLatch. So I would name it LatchWaitSet. With that name I won't be
surprised by that the function is pointing WL_LATCH_SET by index 0
without any explanation when calling ModifyWaitSet.

Ok, I changed it to LatchWaitSet. I also replaced the index 0 with a
symbolic name LatchWaitSetLatchPos, to make that clearer.

@@ -700,7 +739,11 @@ FreeWaitEventSet(WaitEventSet *set)
ReleaseExternalFD();
#elif defined(WAIT_USE_KQUEUE)
close(set->kqueue_fd);
-       ReleaseExternalFD();
+       if (set->kqueue_fd >= 0)
+       {
+               close(set->kqueue_fd);
+               ReleaseExternalFD();
+       }

Did you forget to remove the close() outside the if block?
Don't we need the same amendment for epoll_fd with kqueue_fd?

Hmm, maybe I screwed that up when resolving a conflict with the
ReleaseExternalFD() stuff. Fixed.

WaitLatch is defined as "If the latch is already set (and WL_LATCH_SET
is given), the function returns immediately.". But now the function
reacts to latch even if WL_LATCH_SET is not set. I think actually it
is alwys set so I think we need to modify Assert and function comment
following the change.

It seems a bit silly to call WaitLatch() if you don't want to wait for
a latch, but I think we can keep that comment and logic by assigning
set->latch = NULL when you wait without WL_LATCH_SET. I tried that in
the attached.

0004: "Introduce RemoveWaitEvent()."

We'll need that to be able to handle connections that are closed and
reopened under the covers by libpq (replication, postgres_fdw). We
also wanted this on a couple of other threads for multiplexing FDWs,
to be able to adjust the wait set dynamically for a proposed async
Append feature.

The implementation is a little naive, and leaves holes in the
"pollfds" and "handles" arrays (for poll() and win32 implementations).
That could be improved with a bit more footwork in a later patch.

XXX The Windows version is based on reading documentation. I'd be
very interested to know if check-world passes (especially
contrib/postgres_fdw and src/test/recovery). Unfortunately my
appveyor.yml fu is not yet strong enough.

I didn't find the documentation about INVALID_HANDLE_VALUE in
lpHandles. Could you give me the URL for that?

I was looking for how you do the equivalent of Unix file descriptor -1
in a call to poll(), and somewhere I read that INVALID_HANDLE_VALUE
has the right effect. I can't find that reference now. Apparently it
works because that's the pseudo-handle value -1 that is returned by
GetCurrentProcess(), and waiting for your own process waits forever so
it's a suitable value for holes in an array of event handles. We
should probably call GetCurrentProcess() instead, but really that is
just stand-in code: we should rewrite it so that we don't need holes!
That might require a second array for use by the poll and win32
implementations. Let's return to that in a later CF?

On Mon, Mar 30, 2020 at 6:15 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

0008: "Use WL_EXIT_ON_PM_DEATH in FeBeWaitSet."

The FATAL message you get if you happen to be waiting for IO rather
than waiting somewhere else seems arbitrarily different. By switching
to a standard automatic exit, it opens the possibility of using
FeBeWaitSet in a couple more places that would otherwise need to
create their own WES (see also [1]). Thoughts?

Don't we really need any message on backend disconnection due to
postmaster death? As for authentication code, it seems to me the
rationale for not writing the log is that the connection has not been
established at the time. (That depends on what we think the
"connection" is.)

To me, it looks like this variation is just from a time when
postmaster death handling was less standardised. The comment (removed
by this patch) even introduces the topic of postmaster exit as if
you've never heard of it, in this arbitrary location in the tree, one
wait point among many (admittedly one that is often reached). I don't
think there is any good reason for random timing to affect whether or
not you get a hand crafted FATAL message.

However, if others don't like this change, we could drop this patch
and still use FeBeWaitSet for walsender.c. It'd just need to handle
postmaster death explicitly. It felt weird to be adding new code that
has to handle postmaster death explicitly, which is what led me to
notice that the existing FeBeWaitSet coding (and resulting user
experience) is different from other code.

0009: "Use FeBeWaitSet for walsender.c."

Enabled by 0008.

It works and doesn't change behavior. But I found it a bit difficult
to find what events the WaitEventSetWait waits. Maybe a comment at
the caller sites would be sufficient. I think any comment about the
bare number "0" as the event position of ModifyWaitEvent is also
needed.

Agreed. I changed it to use symbolic names
FeBeWaitSet{Socket,Latch}Pos in the new code and also in the
pre-existing code like this.

By the way I found that pqcomm.c uses -1 instead of PGINVALID_SOCKET
for AddWaitEventToSet.

Oh yeah, that's a pre-existing problem. Fixed, since that code was
changed by the patch anyway.

Attachments:

v5-0002-Use-WaitLatch-for-condition-variables.patchtext/x-patch; charset=US-ASCII; name=v5-0002-Use-WaitLatch-for-condition-variables.patchDownload+5-24
v5-0003-Introduce-a-WaitEventSet-for-the-stats-collector.patchtext/x-patch; charset=US-ASCII; name=v5-0003-Introduce-a-WaitEventSet-for-the-stats-collector.patchDownload+14-11
v5-0004-Use-WL_EXIT_ON_PM_DEATH-in-FeBeWaitSet.patchtext/x-patch; charset=US-ASCII; name=v5-0004-Use-WL_EXIT_ON_PM_DEATH-in-FeBeWaitSet.patchDownload+1-30
v5-0001-Use-a-long-lived-WaitEventSet-for-WaitLatch.patchtext/x-patch; charset=US-ASCII; name=v5-0001-Use-a-long-lived-WaitEventSet-for-WaitLatch.patchDownload+66-17
v5-0005-Introduce-symbolic-names-for-FeBeWaitSet-position.patchtext/x-patch; charset=US-ASCII; name=v5-0005-Introduce-symbolic-names-for-FeBeWaitSet-position.patchDownload+22-8
v5-0006-Use-FeBeWaitSet-for-walsender.c.patchtext/x-patch; charset=US-ASCII; name=v5-0006-Use-FeBeWaitSet-for-walsender.c.patchDownload+15-18
#11Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#10)
Re: Reducing WaitEventSet syscall churn

On Tue, Jul 14, 2020 at 6:51 PM Thomas Munro <thomas.munro@gmail.com> wrote:

In the meantime, here's a rebase of the more straightforward patches
in the stack. These are the ones that deal only with fixed sets of
file descriptors, and they survive check-world on Linux,
Linux+EXEC_BACKEND (with ASLR disabled) and FreeBSD, and at least
check on macOS and Windows (my CI recipes need more work to get
check-world working on those two). There's one user-visible change
that I'd appreciate feedback on: I propose to drop the FATAL error
when the postmaster goes away, to make things more consistent. See
below for more on that.

Here's the effect of patches 0001-0003 on the number of relevant
system calls generate by "make check" on Linux and FreeBSD, according
to strace/truss -f -c:

epoll_create1: 4,825 -> 865
epoll_ctl: 12,454 -> 2,721
epoll_wait: ~45k -> ~45k
close: ~81k -> ~77k

kqueue: 4,618 -> 866
kevent: ~54k -> ~46k
close: ~65k -> ~61k

I pushed those three patches, but will wait for more discussion on the rest.

#12Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#11)
Re: Reducing WaitEventSet syscall churn

On Thu, Jul 30, 2020 at 5:50 PM Thomas Munro <thomas.munro@gmail.com> wrote:

I pushed those three patches, but will wait for more discussion on the rest.

And here's a rebase, to make cfbot happy.

Attachments:

v6-0001-Use-WL_EXIT_ON_PM_DEATH-in-FeBeWaitSet.patchtext/x-patch; charset=US-ASCII; name=v6-0001-Use-WL_EXIT_ON_PM_DEATH-in-FeBeWaitSet.patchDownload+1-30
v6-0002-Introduce-symbolic-names-for-FeBeWaitSet-position.patchtext/x-patch; charset=US-ASCII; name=v6-0002-Introduce-symbolic-names-for-FeBeWaitSet-position.patchDownload+22-8
v6-0003-Use-FeBeWaitSet-for-walsender.c.patchtext/x-patch; charset=US-ASCII; name=v6-0003-Use-FeBeWaitSet-for-walsender.c.patchDownload+15-18
#13Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#12)
Re: Reducing WaitEventSet syscall churn

On Fri, Jul 31, 2020 at 9:42 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Thu, Jul 30, 2020 at 5:50 PM Thomas Munro <thomas.munro@gmail.com> wrote:

I pushed those three patches, but will wait for more discussion on the rest.

And here's a rebase, to make cfbot happy.

And again.

To restate the two goals of the remaining patches:

1. Use FeBeWaitSet for walsender, instead of setting up and tearing
down temporary WaitEventSets all the time.
2. Use the standard WL_EXIT_ON_PM_DEATH flag for FeBeWaitSet, instead
of the special case ereport(FATAL, ... "terminating connection due to
unexpected postmaster exit" ...).

For point 2, the question I am raising is: why should users get a
special FATAL message in some places and not others, for PM death?
However, if people are attached to that behaviour, we could still
achieve goal 1 without goal 2 by handling PM death explicitly in
walsender.c and I'd be happy to post an alternative patch set like
that.

Attachments:

v7-0001-Use-WL_EXIT_ON_PM_DEATH-in-FeBeWaitSet.patchapplication/x-patch; name=v7-0001-Use-WL_EXIT_ON_PM_DEATH-in-FeBeWaitSet.patchDownload+1-30
v7-0002-Introduce-symbolic-names-for-FeBeWaitSet-position.patchapplication/x-patch; name=v7-0002-Introduce-symbolic-names-for-FeBeWaitSet-position.patchDownload+22-8
v7-0003-Use-FeBeWaitSet-for-walsender.c.patchapplication/x-patch; name=v7-0003-Use-FeBeWaitSet-for-walsender.c.patchDownload+17-19
#14Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#13)
Re: Reducing WaitEventSet syscall churn

On Tue, Jan 5, 2021 at 6:10 PM Thomas Munro <thomas.munro@gmail.com> wrote:

For point 2, the question I am raising is: why should users get a
special FATAL message in some places and not others, for PM death?
However, if people are attached to that behaviour, we could still
achieve goal 1 without goal 2 by handling PM death explicitly in
walsender.c and I'd be happy to post an alternative patch set like
that.

Here's the alternative patch set, with no change to existing error
message behaviour. I'm going to commit this version and close this CF
item soon if there are no objections.

Attachments:

v8-0001-Introduce-symbolic-names-for-FeBeWaitSet-position.patchtext/x-patch; charset=US-ASCII; name=v8-0001-Introduce-symbolic-names-for-FeBeWaitSet-position.patchDownload+24-8
v8-0002-Use-FeBeWaitSet-for-walsender.c.patchtext/x-patch; charset=US-ASCII; name=v8-0002-Use-FeBeWaitSet-for-walsender.c.patchDownload+25-19
#15Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#14)
Re: Reducing WaitEventSet syscall churn

On Sat, Feb 27, 2021 at 2:48 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Here's the alternative patch set, with no change to existing error
message behaviour. I'm going to commit this version and close this CF
item soon if there are no objections.

Pushed.

That leaves just walreceiver and postgres_fdw in need of WaitEventSet
improvements along these lines. The raw ingredients for that were
present in earlier patches, and I think Horiguchi-san had the right
idea: we should use the libpq event system to adjust our WES as
appropriate when it changes the socket underneath us. I will leave
this CF entry open a bit longer in case he would like to post an
updated version of that part (considering Tom's feedback[1]/messages/by-id/2446176.1594495351@sss.pgh.pa.us). If not,
we can close this and try again in the next cycle.

[1]: /messages/by-id/2446176.1594495351@sss.pgh.pa.us