[PATCH] OAuth: fix performance bug with stuck multiplexer events

Started by Jacob Champion9 months ago24 messages
Jump to latest
#1Jacob Champion
jacob.champion@enterprisedb.com

Hi all,

The current implementation of the OAuth flow is overly eager to signal
readiness when there's nothing to do, so sometimes we burn CPU for no
reason. This was discussed briefly back in [1]/messages/by-id/CAOYmi+n4EDOOUL27_OqYT2-F2rS6S+3mK-ppWb2Ec92UEoUbYA@mail.gmail.com, but I'll summarize
here so no one has to dig through that megathread.

= Background =

A major interaction between libpq and libcurl happens through a
callback -- register_socket() in oauth-curl.c -- which links the
sockets that Curl cares about into a single descriptor that our
clients can wait on. This descriptor, which is either an epoll set or
a kqueue, is called the multiplexer.

Curl has wide latitude to make this interaction as simple or as
complicated as it wants, and the complexity additionally depends on
the transport in use. For a simple HTTP request/response with older
Curls, the callback order might look like

- wait for reads on fd 5
- done with fd 5

But for a modern dual-stack system speaking HTTPS, you might get something like

- wait for writes on fd 6, time out after 200ms
- wait for writes on fd 7
- done with fd 6, cancel the timeout
- wait for reads only on fd 7, time out after five minutes
- wait for both reads and writes on fd 7
...
- done with fd 7, cancel the timeout

= Getting Stuck =

The Linux/epoll implementation handles the latter case well (with one
caveat, discussed later), but with the BSD/kqueue implementation, the
multiplexer can get "stuck open" on stale events that Curl no longer
cares about, and clients of libpq can loop on PQconnectPoll()
pointlessly.

The flow still works -- there's no bug in functionality, just in
performance -- but that also makes it hard to detect in tests. And for
simpler network cases (the ones in our tests), the bug is nearly
invisible, because Curl will unregister descriptors and clear the
stuck events before they have any real effect.

0001, attached, fixes this by counting the number of outstanding event
registrations and draining up to that many events from the kqueue
after libcurl gives control back to us. This ensures that any stale
events will be knocked back down. (This design was originally
suggested by Thomas Munro in a related conversation about kqueue
corner cases; thanks Thomas!)

If you follow the flow from start to finish, you may notice the
following oddity: I took great pains not to track which events had
been set, instead telling libcurl "something happened, figure it out"
-- and now in this patchset I'm pulling all of those events off in a
big wave only to discard them. As punishment for trying to keep the
epoll side of things simple, I'm now having to add unnecessary
complexity to the kqueue side. Seems like I'll eventually need to
balance the two out again.

= Timeouts =

There is a similar "stale event" bug for the timer, with both epoll
and kqueue: if Curl does not disable a timeout explicitly after it
fires, that timer's descriptor will simply remain signaled until
something else happens to set it.

Unfortunately, we can't solve this bug in the same place as 0001. If
we clear stale timers after calling into Curl, that would introduce a
race: Curl may have set a short timeout that has already passed, and
we need that event to remain signaled so that the client will call us
back.

So expired timers have to be cleared _before_ calling back into Curl,
which I have done in 0002. As part of that, I've strengthened the
timer_expired() API so that it always returns false when the timer is
disabled. It was previously a don't-care, since we only called it when
we knew the timer must be running, so the kqueue and epoll
implementations used to give different answers.

= Testing =

It was hard to notice any of this with the current tests, because
they're mostly focused on ensuring that the flow works, not on
internal efficiency. I tried to find a way to internally detect when a
file descriptor had gone stale by accident, but I think it's hard to
do that without false positives if we don't know what Curl is doing.

In the end, I decided to test a heuristic: we expect roughly 5-15
calls for these simple localhost flows, so if we see something more
than an order of magnitude above that, something is wrong. (When "bad
behavior" happens in the CI, I see hundreds or thousands of calls.)
This is implemented in 0003 by simply counting the number of calls and
printing them at the end of the flow in debug mode.

I strengthened a bunch of postconditions in the first two patches, so
to try to prevent decay (and keep other maintainers from having to
remember all this), I've added a unit test executable that pins many
internal expectations. It's essentially a version of the libpq-oauth
plugin which has been given a main() function and TAP macros. This is
done in 0004.

I feel confident that these new tests are useful, because:

- The new unit tests caught that my initial register_socket attempt,
written on macOS, did not work on the other BSDs. (EV_RECEIPT works
differently there.)
- If I revert the changes to register_socket, the new unit tests fail
on all BSDs in the CI, and the call-count test fails on NetBSD and
OpenBSD.
- If I comment out the new call to drain_timer_events(), the
call-count test fails on macOS.
- If I revert the changes to timer_expired, the new unit tests fail on
Linux. (There are no user-visible effects of that change, because it
doesn't lead to more calls.)

Unfortunately, if I just comment out the new call to
drain_socket_events(), nothing fails in the CI. :( But my personal
test suite (which makes use of TLS plus dual-stack sockets) does
notice the regression in the call count immediately, with some tests
using tens of thousands of calls for a single client flow without the
fix. I hope that's good enough for now, because I'm not sure if I can
safely push the additional TLS infrastructure into the oauth_validator
tests for 18.

My plan, if this code seems reasonable, is to backport 0001-0003, but
keep the larger 0004 on HEAD only until it has proven to be stable.
It's a big new suite and I want to make sure it's not flapping on some
buildfarm animal. Eventually I'll backport that too.

WDYT?

Thanks,
--Jacob

[1]: /messages/by-id/CAOYmi+n4EDOOUL27_OqYT2-F2rS6S+3mK-ppWb2Ec92UEoUbYA@mail.gmail.com

Attachments:

0001-oauth-Remove-stale-events-from-the-kqueue-multiplexe.patchapplication/octet-stream; name=0001-oauth-Remove-stale-events-from-the-kqueue-multiplexe.patchDownload+166-53
0002-oauth-Remove-expired-timers-from-the-multiplexer.patchapplication/octet-stream; name=0002-oauth-Remove-expired-timers-from-the-multiplexer.patchDownload+68-41
0003-oauth-Track-total-call-count-during-a-client-flow.patchapplication/octet-stream; name=0003-oauth-Track-total-call-count-during-a-client-flow.patchDownload+52-2
0004-oauth-Add-unit-tests-for-multiplexer-handling.patchapplication/octet-stream; name=0004-oauth-Add-unit-tests-for-multiplexer-handling.patchDownload+547-1
#2Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Jacob Champion (#1)
Re: [PATCH] OAuth: fix performance bug with stuck multiplexer events

Hi all,

On Thu, Jun 26, 2025 at 4:33 PM Jacob Champion
<jacob.champion@enterprisedb.com> wrote:

My plan, if this code seems reasonable, is to backport 0001-0003, but
keep the larger 0004 on HEAD only until it has proven to be stable.
It's a big new suite and I want to make sure it's not flapping on some
buildfarm animal. Eventually I'll backport that too.

Any thoughts on the approach? Too big/too scary/too BSD-specific?

A small bit of self-review: a comment I wrote in the tests suggested
that the choice of readable/writable events was up to the multiplexer
implementation, but it *must* choose readable, due to the hardcoded
use of PGRES_POLLING_READING throughout the current code. Updated in
v2.

Thanks,
--Jacob

Attachments:

since-v1.diff.txttext/plain; charset=US-ASCII; name=since-v1.diff.txtDownload
v2-0001-oauth-Remove-stale-events-from-the-kqueue-multipl.patchapplication/octet-stream; name=v2-0001-oauth-Remove-stale-events-from-the-kqueue-multipl.patchDownload+166-53
v2-0002-oauth-Remove-expired-timers-from-the-multiplexer.patchapplication/octet-stream; name=v2-0002-oauth-Remove-expired-timers-from-the-multiplexer.patchDownload+68-41
v2-0003-oauth-Track-total-call-count-during-a-client-flow.patchapplication/octet-stream; name=v2-0003-oauth-Track-total-call-count-during-a-client-flow.patchDownload+52-2
v2-0004-oauth-Add-unit-tests-for-multiplexer-handling.patchapplication/octet-stream; name=v2-0004-oauth-Add-unit-tests-for-multiplexer-handling.patchDownload+547-1
#3Thomas Munro
thomas.munro@gmail.com
In reply to: Jacob Champion (#2)
Re: [PATCH] OAuth: fix performance bug with stuck multiplexer events

On Tue, Jul 29, 2025 at 8:52 AM Jacob Champion
<jacob.champion@enterprisedb.com> wrote:

On Thu, Jun 26, 2025 at 4:33 PM Jacob Champion
<jacob.champion@enterprisedb.com> wrote:

My plan, if this code seems reasonable, is to backport 0001-0003, but
keep the larger 0004 on HEAD only until it has proven to be stable.
It's a big new suite and I want to make sure it's not flapping on some
buildfarm animal. Eventually I'll backport that too.

Any thoughts on the approach? Too big/too scary/too BSD-specific?

A small bit of self-review: a comment I wrote in the tests suggested
that the choice of readable/writable events was up to the multiplexer
implementation, but it *must* choose readable, due to the hardcoded
use of PGRES_POLLING_READING throughout the current code. Updated in
v2.

[FYI, I'm looking into this and planning to post a review in 1-2 days...]

#4Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Thomas Munro (#3)
Re: [PATCH] OAuth: fix performance bug with stuck multiplexer events

On Mon, Aug 4, 2025 at 7:53 AM Thomas Munro <thomas.munro@gmail.com> wrote:

[FYI, I'm looking into this and planning to post a review in 1-2 days...]

Thanks so much!

--Jacob

#5Thomas Munro
thomas.munro@gmail.com
In reply to: Jacob Champion (#4)
Re: [PATCH] OAuth: fix performance bug with stuck multiplexer events

On Tue, Aug 5, 2025 at 3:24 AM Jacob Champion
<jacob.champion@enterprisedb.com> wrote:

On Mon, Aug 4, 2025 at 7:53 AM Thomas Munro <thomas.munro@gmail.com> wrote:

[FYI, I'm looking into this and planning to post a review in 1-2 days...]

0001:

So, the problem is that poll(kqueue_fd) reports POLLIN if any events
are queued, but level-triggered events are only rechecked and possibly
cancelled if you actually call kevent(). Hmm, what if you just called
kevent() with an output array of size one?

* If it returns 0, that means it has rechecked all queued
level-triggered events and booted them all out because they are no
longer true. poll(fd) won't report POLLIN until one of them is queued
again.

* If it returns 1, then it stopped on the first level-triggered event
that it rechecked and found to be still true. Who cares if there are
more that didn't get rechecked? poll(fd) will report POLLIN either
way, and that's what you want.

#6Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Thomas Munro (#5)
Re: [PATCH] OAuth: fix performance bug with stuck multiplexer events

On Wed, Aug 6, 2025 at 8:04 AM Thomas Munro <thomas.munro@gmail.com> wrote:

* If it returns 1, then it stopped on the first level-triggered event
that it rechecked and found to be still true. Who cares if there are
more that didn't get rechecked? poll(fd) will report POLLIN either
way, and that's what you want.

I think the weaker guarantee might be sufficient. I was trying to get
a stronger primitive in place so that we wouldn't have to worry about
it down the line, but it is a lot of code to pay...

One sharp edge of that strategy is caught by the new tests, which is
that if you call drain_socket_events() and then unset the timer, your
multiplexer is still readable until you call drain_socket_events() yet
again. At the moment, our code only ever calls those two in the
opposite order (due to the race condition pointed out in 0002); we'd
just have to keep that in mind. Maybe "drain" would no longer be the
verb to use there.

--Jacob

#7Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Jacob Champion (#6)
Re: [PATCH] OAuth: fix performance bug with stuck multiplexer events

On Wed, Aug 6, 2025 at 9:13 AM Jacob Champion
<jacob.champion@enterprisedb.com> wrote:

Maybe "drain" would no longer be the
verb to use there.

I keep describing this as "combing" the queue when I talk about it in
person, so v3-0001 renames this new operation to comb_multiplexer().
And the CI (plus the more strenuous TLS tests) confirms that the
callback count is still stable with this weaker guarantee, so I've
gotten rid of the event-counting code.

Now that I'm no longer counting events, I can collapse the changes to
register_socket(). I can't revert those changes entirely, because then
we regress the case where Curl switches a socket from IN to OUT (this
is enforced by the new unit tests). But I'm not sure that the existing
comment adequately explained that fix anyway, and I didn't remember to
call it out in my initial email, so I've split it out into v3-0002.
It's much smaller.

The tests (now in 0005) have been adjusted for the new "combing"
behavior, and I've added a case to ensure that multiple stale events
are swept up by a single call to comb_multiplexer().

Thanks!
--Jacob

Attachments:

since-v2.diff.txttext/plain; charset=US-ASCII; name=since-v2.diff.txtDownload+0-2
v3-0001-oauth-Remove-stale-events-from-the-kqueue-multipl.patchapplication/octet-stream; name=v3-0001-oauth-Remove-stale-events-from-the-kqueue-multipl.patchDownload+68-7
v3-0002-oauth-Ensure-unused-socket-registrations-are-remo.patchapplication/octet-stream; name=v3-0002-oauth-Ensure-unused-socket-registrations-are-remo.patchDownload+14-9
v3-0003-oauth-Remove-expired-timers-from-the-multiplexer.patchapplication/octet-stream; name=v3-0003-oauth-Remove-expired-timers-from-the-multiplexer.patchDownload+68-41
v3-0004-oauth-Track-total-call-count-during-a-client-flow.patchapplication/octet-stream; name=v3-0004-oauth-Track-total-call-count-during-a-client-flow.patchDownload+52-2
v3-0005-oauth-Add-unit-tests-for-multiplexer-handling.patchapplication/octet-stream; name=v3-0005-oauth-Add-unit-tests-for-multiplexer-handling.patchDownload+600-1
#8Thomas Munro
thomas.munro@gmail.com
In reply to: Jacob Champion (#7)
Re: [PATCH] OAuth: fix performance bug with stuck multiplexer events

On Thu, Aug 7, 2025 at 11:55 AM Jacob Champion
<jacob.champion@enterprisedb.com> wrote:

On Wed, Aug 6, 2025 at 9:13 AM Jacob Champion
<jacob.champion@enterprisedb.com> wrote:

Maybe "drain" would no longer be the
verb to use there.

I keep describing this as "combing" the queue when I talk about it in
person, so v3-0001 renames this new operation to comb_multiplexer().
And the CI (plus the more strenuous TLS tests) confirms that the
callback count is still stable with this weaker guarantee, so I've
gotten rid of the event-counting code.

I was about to hit send on an email suggesting "reset_multiplexer()",
and an attempt at distilling the explanation to a short paragraph, but
your names and explanations are also good so please feel 100% free to
ignore these suggestions.

"Unlike epoll descriptors, kqueue descriptors only transition from
readable to unreadable when kevent() is called and finds nothing,
after removing level-triggered conditions that have gone away. We
therefore need a dummy kevent() call after operations might have been
performed on the monitored sockets or timer_fd. Any event returned is
ignored here, but it also remains queued (being level-triggered) and
leaves the descriptor readable. This is a no-op for epoll
descriptors."

Reviewing the timer stuff, it is again tempting to try to use an
EVFILT_TIMER directly, but I like your approach: it's nice to be able
to mirror the Linux coding, with this minor kink ironed out.

FWIW I re-read the kqueue paper's discussion of the goals of making
kqueue descriptors themselves monitorable/pollable, and it seems it
was mainly intended for hierarchies of kqueues, like your timer_fd,
with the specific aim of expressing priorities. It doesn't talk about
giving them to code that doesn't know it has a kqueue fd (the client)
and never calls kevent() and infers the events instead (libcurl).
That said, the fact that the filter function for kqueue fds just
checks queue size > 0 without running the filter functions for the
queued events does seem like a bit of an abstraction leak from this
vantage point. At least it's easy enough to work around in the
kqueue-managing middleman code once you understand it.

Now that I'm no longer counting events, I can collapse the changes to
register_socket(). I can't revert those changes entirely, because then
we regress the case where Curl switches a socket from IN to OUT (this
is enforced by the new unit tests). But I'm not sure that the existing
comment adequately explained that fix anyway, and I didn't remember to
call it out in my initial email, so I've split it out into v3-0002.
It's much smaller.

Much nicer! Yeah, that all makes sense.

The tests (now in 0005) have been adjusted for the new "combing"
behavior, and I've added a case to ensure that multiple stale events
are swept up by a single call to comb_multiplexer().

This all looks pretty good to me. I like the C TAP test. PostgreSQL
needs more of this.

s/signalled/signaled/ (= US spelling) in a couple of places.

#9Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#8)
Re: [PATCH] OAuth: fix performance bug with stuck multiplexer events

On Thu, Aug 7, 2025 at 1:45 PM Thomas Munro <thomas.munro@gmail.com> wrote:

I like the C TAP test. PostgreSQL
needs more of this.

I should add, I didn't look closely at that part since you said it's
not in scope for back-patching. I'd like to, though, later.

I wonder if you would be interested in this attempt at centralised
infrastructure for unit testing our C code over here. I'm not
suggesting it for your immediate problem, just noting the overlap:

/messages/by-id/CA+hUKG+ajSQ_8eu2AogTncOnZ5me2D-Cn66iN_-wZnRjLN+icg@mail.gmail.com

Basically I would like to be able to dump easy-to-write files into the
tree that say stuff like this ↓ and have the build scripts find them,
build them and test them without all the module boilerplate stuff or a
running server (though that aspect is obviously not relevant for your
frontend case). Like you find in googletest or various xunit-style
systems in other projects, but integrated with our TAP universe (or
whatever replaces it if we escape from Perl). But I never got the
configure part of it working.

PG_BEGIN_TESTS();
...
PG_EXPECT_EQ(pg_preadv(fd, iov, 2, 11), 0);
PG_EXPECT_EQ(pg_pread(fd, buffer, 10, 0), 10);
PG_EXPECT_EQ_STR(buffer, "helloworld");
...
PG_END_TESTS();

In reply to: Jacob Champion (#7)
Re: [PATCH] OAuth: fix performance bug with stuck multiplexer events

Jacob Champion <jacob.champion@enterprisedb.com> writes:

From 50257bf32eb2b0972e5139ac4a79367372c77385 Mon Sep 17 00:00:00 2001
From: Jacob Champion <jacob.champion@enterprisedb.com>
Date: Wed, 5 Mar 2025 15:04:34 -0800
Subject: [PATCH v3 5/5] oauth: Add unit tests for multiplexer handling

I haven't read the meat of the patch, but I have some comments on the
tests:

+IPC::Run::run ['oauth_tests'],
+  '>', IPC::Run::new_chunker, sub { print {$out} $_[0] },
+  '2>', IPC::Run::new_chunker, sub { print {$err} $_[0] }
+  or die "oauth_tests returned $?";

We've recently switched to using fat commas (=>) between options and
their arguments, and that includes the file redirections in IPC::Run.
Although not semantically meaningful, I'd also be tempted to put parens
around the argument list for each redirect, so it's clear that they go
together.

Also, indirect object syntax (print {$fh} ...) is ugly and
old-fashioned, it's nicer to call it as a method on the filehandle.

So I'd write the above as:

IPC::Run::run ['oauth_tests'],
'>' => (IPC::Run::new_chunker, sub { $out->print($_[0]) }),
'2>' => (IPC::Run::new_chunker, sub { $err->print($_[0]) })
or die "oauth_tests returned $?";

As for the C TAP tests, there's already a bunch of TAP-outputting
infrastructure in pg_regress.c. Would it make sense to factor that out
into a common library?

- ilmari

#11Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Thomas Munro (#8)
Re: [PATCH] OAuth: fix performance bug with stuck multiplexer events

On Wed, Aug 6, 2025 at 6:46 PM Thomas Munro <thomas.munro@gmail.com> wrote:

"Unlike epoll descriptors, kqueue descriptors only transition from
readable to unreadable when kevent() is called and finds nothing,
after removing level-triggered conditions that have gone away. We
therefore need a dummy kevent() call after operations might have been
performed on the monitored sockets or timer_fd. Any event returned is
ignored here, but it also remains queued (being level-triggered) and
leaves the descriptor readable. This is a no-op for epoll
descriptors."

I really like this; I'm working it into the doc comment.

FWIW I re-read the kqueue paper's discussion of the goals of making
kqueue descriptors themselves monitorable/pollable, and it seems it
was mainly intended for hierarchies of kqueues, like your timer_fd,
with the specific aim of expressing priorities. It doesn't talk about
giving them to code that doesn't know it has a kqueue fd (the client)
and never calls kevent() and infers the events instead (libcurl).

Interesting! It would be nice if they papered over this for us, but I
guess that's water under the bridge.

s/signalled/signaled/ (= US spelling) in a couple of places.

Ah. Will fix(?) or else lobby the dictionary companies.

Thank you so much for the reviews!

--Jacob

#12Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Dagfinn Ilmari Mannsåker (#10)
Re: [PATCH] OAuth: fix performance bug with stuck multiplexer events

On Thu, Aug 7, 2025 at 9:35 AM Dagfinn Ilmari Mannsåker
<ilmari@ilmari.org> wrote:

I haven't read the meat of the patch, but I have some comments on the
tests:

Thanks for the review!

+IPC::Run::run ['oauth_tests'],
+  '>', IPC::Run::new_chunker, sub { print {$out} $_[0] },
+  '2>', IPC::Run::new_chunker, sub { print {$err} $_[0] }
+  or die "oauth_tests returned $?";

We've recently switched to using fat commas (=>) between options and
their arguments, and that includes the file redirections in IPC::Run.
Although not semantically meaningful, I'd also be tempted to put parens
around the argument list for each redirect, so it's clear that they go
together.

I have two concerns:
- If I don't put parentheses around the list, the fat comma is
actively misleading.
- As far as I can tell, IPC::Run neither documents nor tests the
ability to pass a list here. (But the tests are a bit of a maze, so
please correct me if there is one.) My fear is that I'll be coupling
against an implementation detail if I write it that way.

So I'm leaning towards keeping it as-is, unless you know of a reason
that the list syntax is guaranteed to work, with the understanding
that it does diverge from what you authored in 19c6e92b1. But I don't
think any of those examples use filters, so I don't feel too bad about
the difference yet?

Also, indirect object syntax (print {$fh} ...) is ugly and
old-fashioned, it's nicer to call it as a method on the filehandle.

That is much nicer; I'll do that.

As for the C TAP tests, there's already a bunch of TAP-outputting
infrastructure in pg_regress.c. Would it make sense to factor that out
into a common library?

Maybe if we got to rule-of-three, but I'd rather not make either
implementation compromise for the sake of the other. IMO, this is a
situation where a bad abstraction would be much costlier than the
duplication: TAP is lightweight, and I think the needs of a unit test
suite and the needs of a characterization test collector are very
different.

--Jacob

#13Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Thomas Munro (#9)
Re: [PATCH] OAuth: fix performance bug with stuck multiplexer events

On Wed, Aug 6, 2025 at 7:43 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Thu, Aug 7, 2025 at 1:45 PM Thomas Munro <thomas.munro@gmail.com> wrote:
I wonder if you would be interested in this attempt at centralised
infrastructure for unit testing our C code over here. I'm not
suggesting it for your immediate problem, just noting the overlap:

/messages/by-id/CA+hUKG+ajSQ_8eu2AogTncOnZ5me2D-Cn66iN_-wZnRjLN+icg@mail.gmail.com

I am all for better testability at the function level, and it even
looks like 0002 solved a problem that stopped me from getting rid of
the Perl shim...

The "test without a running server" part also might have some overlap
with my fuzzing work. So I'll take a closer look later; thanks!

--Jacob

#14Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Jacob Champion (#11)
Re: [PATCH] OAuth: fix performance bug with stuck multiplexer events

On Thu, Aug 7, 2025 at 11:11 AM Jacob Champion
<jacob.champion@enterprisedb.com> wrote:

Thank you so much for the reviews!

Here is v4, with the feedback from both of you. 0001-0004 are planned
for backport; 0005 is slated for master only. Thanks again for the
reviews!

--Jacob

Attachments:

since-v3.diff.txttext/plain; charset=UTF-8; name=since-v3.diff.txtDownload
v4-0001-oauth-Remove-stale-events-from-the-kqueue-multipl.patchapplication/octet-stream; name=v4-0001-oauth-Remove-stale-events-from-the-kqueue-multipl.patchDownload+61-7
v4-0002-oauth-Ensure-unused-socket-registrations-are-remo.patchapplication/octet-stream; name=v4-0002-oauth-Ensure-unused-socket-registrations-are-remo.patchDownload+14-9
v4-0003-oauth-Remove-expired-timers-from-the-multiplexer.patchapplication/octet-stream; name=v4-0003-oauth-Remove-expired-timers-from-the-multiplexer.patchDownload+68-41
v4-0004-oauth-Track-total-call-count-during-a-client-flow.patchapplication/octet-stream; name=v4-0004-oauth-Track-total-call-count-during-a-client-flow.patchDownload+52-2
v4-0005-oauth-Add-unit-tests-for-multiplexer-handling.patchapplication/octet-stream; name=v4-0005-oauth-Add-unit-tests-for-multiplexer-handling.patchDownload+600-1
#15Thomas Munro
thomas.munro@gmail.com
In reply to: Jacob Champion (#14)
Re: [PATCH] OAuth: fix performance bug with stuck multiplexer events

On Fri, Aug 8, 2025 at 8:04 AM Jacob Champion
<jacob.champion@enterprisedb.com> wrote:

On Thu, Aug 7, 2025 at 11:11 AM Jacob Champion
<jacob.champion@enterprisedb.com> wrote:

Thank you so much for the reviews!

Here is v4, with the feedback from both of you. 0001-0004 are planned
for backport; 0005 is slated for master only. Thanks again for the
reviews!

LGTM.

#16Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Thomas Munro (#15)
Re: [PATCH] OAuth: fix performance bug with stuck multiplexer events

On Fri, Aug 8, 2025 at 3:39 AM Thomas Munro <thomas.munro@gmail.com> wrote:

LGTM.

Committed!

Thanks,
--Jacob

In reply to: Jacob Champion (#12)
Re: [PATCH] OAuth: fix performance bug with stuck multiplexer events

Jacob Champion <jacob.champion@enterprisedb.com> writes:

On Thu, Aug 7, 2025 at 9:35 AM Dagfinn Ilmari Mannsåker
<ilmari@ilmari.org> wrote:

I haven't read the meat of the patch, but I have some comments on the
tests:

Thanks for the review!

+IPC::Run::run ['oauth_tests'],
+  '>', IPC::Run::new_chunker, sub { print {$out} $_[0] },
+  '2>', IPC::Run::new_chunker, sub { print {$err} $_[0] }
+  or die "oauth_tests returned $?";

We've recently switched to using fat commas (=>) between options and
their arguments, and that includes the file redirections in IPC::Run.
Although not semantically meaningful, I'd also be tempted to put parens
around the argument list for each redirect, so it's clear that they go
together.

I have two concerns:
- If I don't put parentheses around the list, the fat comma is
actively misleading.
- As far as I can tell, IPC::Run neither documents nor tests the
ability to pass a list here. (But the tests are a bit of a maze, so
please correct me if there is one.) My fear is that I'll be coupling
against an implementation detail if I write it that way.
So I'm leaning towards keeping it as-is, unless you know of a reason
that the list syntax is guaranteed to work, with the understanding
that it does diverge from what you authored in 19c6e92b1. But I don't
think any of those examples use filters, so I don't feel too bad about
the difference yet?

When I said "not semantically meaningful" I meant that the parens don't
change what gets passed to the function. In Perl, parens only serve to
override precedence and as visual grouping, they don't affect the
structure of the data at all.

To demonstrate:

$ perl -MJSON::PP=encode_json -E 'say encode_json([1, 2, 3])'
[1,2,3]

$ perl -MJSON::PP=encode_json -E 'say encode_json([1 => (2, 3)])'
[1,2,3]

Also, indirect object syntax (print {$fh} ...) is ugly and
old-fashioned, it's nicer to call it as a method on the filehandle.

That is much nicer; I'll do that.

As for the C TAP tests, there's already a bunch of TAP-outputting
infrastructure in pg_regress.c. Would it make sense to factor that out
into a common library?

Maybe if we got to rule-of-three, but I'd rather not make either
implementation compromise for the sake of the other. IMO, this is a
situation where a bad abstraction would be much costlier than the
duplication: TAP is lightweight, and I think the needs of a unit test
suite and the needs of a characterization test collector are very
different.

Fair enough.

--Jacob

- ilmari

#18Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Dagfinn Ilmari Mannsåker (#17)
Re: [PATCH] OAuth: fix performance bug with stuck multiplexer events

On Fri, Aug 8, 2025 at 1:07 PM Dagfinn Ilmari Mannsåker
<ilmari@ilmari.org> wrote:

$ perl -MJSON::PP=encode_json -E 'say encode_json([1, 2, 3])'
[1,2,3]

$ perl -MJSON::PP=encode_json -E 'say encode_json([1 => (2, 3)])'
[1,2,3]

I swear, this language.

But:

$ perl -MJSON::PP=encode_json -E 'say encode_json(1,2)'
Too many arguments for JSON::PP::encode_json at -e line 1, near "2)
$ perl -MJSON::PP=encode_json -E 'say encode_json((1,2))'
2

So what's going on there? (Google is not very helpful for these sorts
of Perl problems; I don't even know how to describe this.)

I had to revert the test for unrelated reasons [1]/messages/by-id/CAOYmi+nCkoh3zB+GkZad44=FNskwUg6F1kmuxqQZzng7Zgj5tw@mail.gmail.com, so if this is
indeed guaranteed to be safe then I can make the change in my next
attempt.

Thanks!
--Jacob

[1]: /messages/by-id/CAOYmi+nCkoh3zB+GkZad44=FNskwUg6F1kmuxqQZzng7Zgj5tw@mail.gmail.com

In reply to: Jacob Champion (#18)
Re: [PATCH] OAuth: fix performance bug with stuck multiplexer events

Jacob Champion <jacob.champion@enterprisedb.com> writes:

On Fri, Aug 8, 2025 at 1:07 PM Dagfinn Ilmari Mannsåker
<ilmari@ilmari.org> wrote:

$ perl -MJSON::PP=encode_json -E 'say encode_json([1, 2, 3])'
[1,2,3]

$ perl -MJSON::PP=encode_json -E 'say encode_json([1 => (2, 3)])'
[1,2,3]

I swear, this language.

But:

$ perl -MJSON::PP=encode_json -E 'say encode_json(1,2)'
Too many arguments for JSON::PP::encode_json at -e line 1, near "2)
$ perl -MJSON::PP=encode_json -E 'say encode_json((1,2))'
2

So what's going on there? (Google is not very helpful for these sorts
of Perl problems; I don't even know how to describe this.)

That's because encode_json has a prototype[1]https://perldoc.perl.org/perlsub#Prototypes, which changes how the
argument list is parsed: no longer just as a flat list of values like a
normal function. Specifically, it has a prototype of '$', which means
it only takes one argument, which is evaluated in scalar context. So
the first example is a syntax error, but in the second example the
parenthesised expression is the single argument. Becuse it's in scalar
context, the comma is actually the scalar comma operator, not the list
element separator, so the return value is the right-hand side of the
comma (just like in C), not the length of the would-be list.
Onfusingly, they are both 2 here (which is why 1,2,3,... are bad values
to use when exploring presedence/context issues in Perl, sorry for doing
that in my example). More clearly (fsvo):

$ perl -MJSON::PP=encode_json -E 'say encode_json((11,12))'
12

- ilmari

[1]: https://perldoc.perl.org/perlsub#Prototypes

#20Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Dagfinn Ilmari Mannsåker (#19)
Re: [PATCH] OAuth: fix performance bug with stuck multiplexer events

On Fri, Aug 8, 2025 at 2:16 PM Dagfinn Ilmari Mannsåker
<ilmari@ilmari.org> wrote:

That's because encode_json has a prototype[1], which changes how the
argument list is parsed: no longer just as a flat list of values like a
normal function. Specifically, it has a prototype of '$', which means
it only takes one argument, which is evaluated in scalar context. So
the first example is a syntax error, but in the second example the
parenthesised expression is the single argument. Becuse it's in scalar
context, the comma is actually the scalar comma operator, not the list
element separator, so the return value is the right-hand side of the
comma (just like in C), not the length of the would-be list.

ron-swanson-throws-away-computer.gif

Well, thank you for the explanation. I'll make that change.

--Jacob

#21Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Jacob Champion (#20)
#22Andrew Dunstan
andrew@dunslane.net
In reply to: Jacob Champion (#21)
#23Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Andrew Dunstan (#22)
#24Andrew Dunstan
andrew@dunslane.net
In reply to: Jacob Champion (#23)