Automatically sizing the IO worker pool
It's hard to know how to set io_workers=3. If it's too small,
io_method=worker's small submission queue overflows and it silently
falls back to synchronous IO. If it's too high, it generates a lot of
pointless wakeups and scheduling overhead, which might be considered
an independent problem or not, but having the right size pool
certainly mitigates it. Here's a patch to replace that GUC with:
io_min_workers=1
io_max_workers=8
io_worker_idle_timeout=60s
io_worker_launch_interval=500ms
It grows the pool when a backlog is detected (better ideas for this
logic welcome), and lets idle workers time out. IO jobs were already
concentrated into the lowest numbered workers, partly because that
seemed to have marginally better latency than anything else tried so
far due to latch collapsing with lucky timing, and partly in
anticipation of this.
The patch also reduces bogus wakeups a bit by being a bit more
cautious about fanout. That could probably be improved a lot more and
needs more research. It's quite tricky to figure out how to suppress
wakeups without throwing potential concurrency away.
The first couple of patches are independent of this topic, and might
be potential cleanups/fixes for master/v18. The last is a simple
latency test.
Ideas, testing, flames etc welcome.
Attachments:
0001-aio-Regularize-io_method-worker-naming-conventions.patchtext/x-patch; charset=US-ASCII; name=0001-aio-Regularize-io_method-worker-naming-conventions.patchDownload+30-31
0002-aio-Remove-IO-worker-ID-references-from-postmaster.c.patchtext/x-patch; charset=US-ASCII; name=0002-aio-Remove-IO-worker-ID-references-from-postmaster.c.patchDownload+14-15
0003-aio-Try-repeatedly-to-give-batched-IOs-to-workers.patchtext/x-patch; charset=US-ASCII; name=0003-aio-Try-repeatedly-to-give-batched-IOs-to-workers.patchDownload+27-4
0004-aio-Adjust-IO-worker-pool-size-automatically.patchtext/x-patch; charset=US-ASCII; name=0004-aio-Adjust-IO-worker-pool-size-automatically.patchDownload+541-122
0005-XXX-read_buffer_loop.patchtext/x-patch; charset=US-ASCII; name=0005-XXX-read_buffer_loop.patchDownload+63-1
On 12/4/25 18:59, Thomas Munro wrote:
It's hard to know how to set io_workers=3.
Hmmm.... enable the below behaviour if "io_workers=auto" (default) ?
Sometimes being able to set this kind of parameters manually helps
tremendously with specific workloads... :S
[snip]
Here's a patch to replace that GUC with:io_min_workers=1
io_max_workers=8
io_worker_idle_timeout=60s
io_worker_launch_interval=500ms
Great as defaults / backwards compat with io_workers=auto. Sounds more
user-friendly to me, at least....
[snip]
Ideas, testing, flames etc welcome.
Logic seems sound, if a bit daunting for inexperienced users --- well,
maybe just a bit more than it is now, but ISTM evolution should try and
flatten novices' learning curve, right?
Just .02€, though.
Thanks,
--
Parkinson's Law: Work expands to fill the time alloted to it.
On Mon, Apr 14, 2025 at 5:45 AM Jose Luis Tallon
<jltallon@adv-solutions.net> wrote:
On 12/4/25 18:59, Thomas Munro wrote:
It's hard to know how to set io_workers=3.
Hmmm.... enable the below behaviour if "io_workers=auto" (default) ?
Why not just delete io_workers? If you really want a fixed number,
you can set io_min_workers==io_max_workers.
What should io_max_workers default to? I guess it could be pretty
large without much danger, but I'm not sure. If it's a small value,
an overloaded storage system goes through two stages: first it fills
the queue up with a backlog of requests until it overflows because the
configured maximum of workers isn't keeping up, and then new
submissions start falling back to synchronous IO, sort of jumping
ahead of the queued backlog, but also stalling if the real reason is
that the storage itself isn't keeping up. Whether it'd be better for
the IO worker pool to balloon all the way up to 32 processes (an
internal limit) if required to try to avoid that with default
settings, I'm not entirely sure. Maybe? Why not at least try to get
all the concurrency possible, before falling back to synchronous?
Queued but not running IOs seem to be strictly worse than queued but
not even trying to run. I'd be interested to hear people's thoughts
and experiences actually trying different kinds of workloads on
different kinds of storage. Whether adding more concurrency actually
helps or just generates a lot of useless new processes before the
backpressure kicks in depends on why it's not keeping up, eg hitting
IOPS, throughput or concurrency limits in the storage. In later work
I hope we can make higher levels smarter about understanding whether
requesting more concurrency helps or hurts with feedback (that's quite
a hard problem that some of my colleagues have been looking into), but
the simpler question here seems to be: should this fairly low level
system-wide setting ship with a default that includes any preconceived
assumptions about that?
It's superficially like max_parallel_workers, which ships with a
default of 8, and that's basically where I plucked that 8 from in the
current patch for lack of a serious idea to propose yet. But it's
also more complex than CPU: you know how many cores you have and you
know things about your workload, but even really small "on the metal"
systems probably have a lot more concurrent I/O capacity -- perhaps
depending on the type of operation! (and so far we only have reads) --
than CPU cores. Especially once you completely abandon the idea that
anyone runs databases on spinning rust in modern times, even on low
end systems, which I think we've more or less agreed to assume these
days with related changes such as the recent *_io_concurrency default
change (1->16). It's actually pretty hard to drive a laptop up to
needing more half a dozen or a dozen or a dozen or so workers with
this patch for especially without debug_io_direct=data ie with fast
double-buffered I/O, but cloud environments may also be where most
databases run these days, and low end cloud configurations have
arbitrary made up limits that may be pretty low, so it all depends....
I really don't know, but one idea is that we could leave it open as
possible, and let users worry about that with higher-level settings
and the query concurrency they choose to generate...
io_method=io_uring is effectively open, so why should io_method=worker
be any different by default? Just some thoughts. I'm not sure.
On Sun, Apr 13, 2025 at 04:59:54AM GMT, Thomas Munro wrote:
It's hard to know how to set io_workers=3. If it's too small,
io_method=worker's small submission queue overflows and it silently
falls back to synchronous IO. If it's too high, it generates a lot of
pointless wakeups and scheduling overhead, which might be considered
an independent problem or not, but having the right size pool
certainly mitigates it. Here's a patch to replace that GUC with:io_min_workers=1
io_max_workers=8
io_worker_idle_timeout=60s
io_worker_launch_interval=500msIt grows the pool when a backlog is detected (better ideas for this
logic welcome), and lets idle workers time out.
I like the idea. In fact, I've been pondering about something like a
"smart" configuration for quite some time, and convinced that a similar
approach needs to be applied to many performance-related GUCs.
Idle timeout and launch interval serving as a measure of sensitivity
makes sense to me, growing the pool when a backlog (queue_depth >
nworkers, so even a slightest backlog?) is detected seems to be somewhat
arbitrary. From what I understand the pool growing velocity is constant
and do not depend on the worker demand (i.e. queue_depth)? It may sounds
fancy, but I've got an impression it should be possible to apply what's
called a "low-pass filter" in the control theory (sort of a transfer
function with an exponential decay) to smooth out the demand and adjust
the worker pool based on that.
As a side note, it might be far fetched, but there are instruments in
queueing theory to figure out how much workers are needed to guarantee a
certain low queueing probability, but for that one needs to have an
average arrival rate (in our case, average number of IO operations
dispatched to workers) and an average service rate (average number of IO
operations performed by workers).
HI
On Sun, Apr 13, 2025 at 04:59:54AM GMT, Thomas Munro wrote:
It's hard to know how to set io_workers=3. If it's too small,
io_method=worker's small submission queue overflows and it silently
falls back to synchronous IO. If it's too high, it generates a lot of
pointless wakeups and scheduling overhead, which might be considered
an independent problem or not, but having the right size pool
certainly mitigates it. Here's a patch to replace that GUC with:io_min_workers=1
io_max_workers=8
io_worker_idle_timeout=60s
io_worker_launch_interval=500msIt grows the pool when a backlog is detected (better ideas for this
logic welcome), and lets idle workers time out.
I also like idea ,can we set a
io_workers= 3
io_max_workers= cpu/4
io_workers_oversubscribe = 3 (range 1-8)
io_workers * io_workers_oversubscribe <=io_max_workers
On Sun, May 25, 2025 at 3:20 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
Show quoted text
On Sun, Apr 13, 2025 at 04:59:54AM GMT, Thomas Munro wrote:
It's hard to know how to set io_workers=3. If it's too small,
io_method=worker's small submission queue overflows and it silently
falls back to synchronous IO. If it's too high, it generates a lot of
pointless wakeups and scheduling overhead, which might be considered
an independent problem or not, but having the right size pool
certainly mitigates it. Here's a patch to replace that GUC with:io_min_workers=1
io_max_workers=8
io_worker_idle_timeout=60s
io_worker_launch_interval=500msIt grows the pool when a backlog is detected (better ideas for this
logic welcome), and lets idle workers time out.I like the idea. In fact, I've been pondering about something like a
"smart" configuration for quite some time, and convinced that a similar
approach needs to be applied to many performance-related GUCs.Idle timeout and launch interval serving as a measure of sensitivity
makes sense to me, growing the pool when a backlog (queue_depth >
nworkers, so even a slightest backlog?) is detected seems to be somewhat
arbitrary. From what I understand the pool growing velocity is constant
and do not depend on the worker demand (i.e. queue_depth)? It may sounds
fancy, but I've got an impression it should be possible to apply what's
called a "low-pass filter" in the control theory (sort of a transfer
function with an exponential decay) to smooth out the demand and adjust
the worker pool based on that.As a side note, it might be far fetched, but there are instruments in
queueing theory to figure out how much workers are needed to guarantee a
certain low queueing probability, but for that one needs to have an
average arrival rate (in our case, average number of IO operations
dispatched to workers) and an average service rate (average number of IO
operations performed by workers).
On Sun, May 25, 2025 at 7:20 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
On Sun, Apr 13, 2025 at 04:59:54AM GMT, Thomas Munro wrote:
It's hard to know how to set io_workers=3. If it's too small,
io_method=worker's small submission queue overflows and it silently
falls back to synchronous IO. If it's too high, it generates a lot of
pointless wakeups and scheduling overhead, which might be considered
an independent problem or not, but having the right size pool
certainly mitigates it. Here's a patch to replace that GUC with:io_min_workers=1
io_max_workers=8
io_worker_idle_timeout=60s
io_worker_launch_interval=500msIt grows the pool when a backlog is detected (better ideas for this
logic welcome), and lets idle workers time out.I like the idea. In fact, I've been pondering about something like a
"smart" configuration for quite some time, and convinced that a similar
approach needs to be applied to many performance-related GUCs.Idle timeout and launch interval serving as a measure of sensitivity
makes sense to me, growing the pool when a backlog (queue_depth >
nworkers, so even a slightest backlog?) is detected seems to be somewhat
arbitrary. From what I understand the pool growing velocity is constant
and do not depend on the worker demand (i.e. queue_depth)? It may sounds
fancy, but I've got an impression it should be possible to apply what's
called a "low-pass filter" in the control theory (sort of a transfer
function with an exponential decay) to smooth out the demand and adjust
the worker pool based on that.As a side note, it might be far fetched, but there are instruments in
queueing theory to figure out how much workers are needed to guarantee a
certain low queueing probability, but for that one needs to have an
average arrival rate (in our case, average number of IO operations
dispatched to workers) and an average service rate (average number of IO
operations performed by workers).
Hi Dmitry,
Thanks for looking, and yeah these are definitely the right sort of
questions. I will be both unsurprised and delighted if someone can
bring some more science to this problem. I did read up on Erlang's
formula C ("This formula is used to determine the number of agents or
customer service representatives needed to staff a call centre, for a
specified desired probability of queuing" according to Wikipedia) and
a bunch of related textbook stuff. And yeah I had a bunch of
exponential moving averages of various values using scaled fixed point
arithmetic (just a bunch of shifts and adds) to smooth inputs, in
various attempts. But ... I'm not even sure if we can say that our
I/O arrivals have a Poisson distribution, since they are not all
independent. I tried more things too, while I was still unsure what I
should even be optimising for. My current answer to that is: low
latency with low variance, as seen with io_uring.
In this version I went back to basics and built something that looks
more like the controls of a classic process/thread pool (think Apache)
or connection pool (think JDBC), with a couple of additions based on
intuition: (1) a launch interval, which acts as a bit of damping
against overshooting on brief bursts that are too far apart, and (2)
the queue length > workers * k as a simple way to determine that
latency is being introduced by not having enough workers. Perhaps
there is a good way to compute an adaptive value for k with some fancy
theories, but k=1 seems to have *some* basis: that's the lowest number
which the pool is too small and *certainly* introducing latency, but
any lower constant is harder to defend because we don't know how many
workers are already awake and about to consume tasks. Something from
queuing theory might provide an adaptive value, but in the end, I
figured we really just want to know if the queue is growing ie in
danger of overflowing (note: the queue is small! 64, and not
currently changeable, more on that later, and the overflow behaviour
is synchronous I/O as back-pressure). You seem to be suggesting that
k=1 sounds too low, not too high, but there is that separate
time-based defence against overshoot in response to rare bursts.
You could get more certainty about jobs already about to be consumed
by a worker that is about to dequeue, by doing a lot more book
keeping: assigning them to workers on submission (separate states,
separate queues, various other ideas I guess). But everything I tried
like that caused latency or latency variance to go up, because it
missed out on the chance for another worker to pick it up sooner
opportunistically. This arrangement has the most stable and
predictable pool size and lowest avg latency and stddev(latency) I
have managed to come up with so far. That said, we have plenty of
time to experiment with better ideas if you want to give it a shot or
propose concrete ideas, given that I missed v18 :-)
About control theory... yeah. That's an interesting bag of tricks.
FWIW Melanie and (more recently) I have looked into textbook control
algorithms at a higher level of the I/O stack (and Melanie gave a talk
about other applications in eg VACUUM at pgconf.dev). In
read_stream.c, where I/O demand is created, we've been trying to set
the desired I/O concurrency level and thus lookahead distance with
adaptive feedback. We've tried a lot of stuff. I hope we can share
some concept patches some time soon, well, maybe in this cycle. Some
interesting recent experiments produced graphs that look a lot like
the ones in the book "Feedback Control for Computer Systems" (an easy
software-person book I found for people without an engineering/control
theory background where the problems match our world more closely, cf
typical texts that are about controlling motors and other mechanical
stuff...). Experimental goals are: find the the smallest concurrent
I/O request level (and thus lookahead distance and thus speculative
work done and buffers pinned) that keeps the I/O stall probability
near zero (and keep adapting, since other queries and applications are
sharing system I/O queues), and if that's not even possible, find the
highest concurrent I/O request level that doesn't cause extra latency
due to queuing in lower levels (I/O workers, kernel, ..., disks).
That second part is quite hard. In other words, if higher levels own
that problem and bring the adaptivity, then perhaps io_method=worker
can get away with being quite stupid. Just a thought...
BTW I would like to push 0001 and 0002 to master/18. They are are not
behaviour changes, they just fix up a bunch of inconsistent (0001) and
misleading (0002) variable naming and comments to reflect reality (in
AIO v1 the postmaster used to assign those I/O worker numbers, now the
postmaster has its own array of slots to track them that is *not*
aligned with the ID numbers/slots in shared memory ie those numbers
you see in the ps status, so that's bound to confuse people
maintaining this code). I just happened to notice that when working
on this dynamic worker pool stuff. If there are no objections I will
go ahead and do that soon.
On Mon, May 26, 2025, 8:01 AM Thomas Munro <thomas.munro@gmail.com> wrote:
But ... I'm not even sure if we can say that our
I/O arrivals have a Poisson distribution, since they are not all
independent.
Yeah, a good point, one have to be careful with assumptions about
distribution -- from what I've read many processes in computer systems are
better described by a Pareto. But the beauty of the queuing theory is that
many results are independent from the distribution (not sure about
dependencies though).
In this version I went back to basics and built something that looks
more like the controls of a classic process/thread pool (think Apache)
or connection pool (think JDBC), with a couple of additions based on
intuition: (1) a launch interval, which acts as a bit of damping
against overshooting on brief bursts that are too far apart, and (2)
the queue length > workers * k as a simple way to determine that
latency is being introduced by not having enough workers. Perhaps
there is a good way to compute an adaptive value for k with some fancy
theories, but k=1 seems to have *some* basis: that's the lowest number
which the pool is too small and *certainly* introducing latency, but
any lower constant is harder to defend because we don't know how many
workers are already awake and about to consume tasks. Something from
queuing theory might provide an adaptive value, but in the end, I
figured we really just want to know if the queue is growing ie in
danger of overflowing (note: the queue is small! 64, and not
currently changeable, more on that later, and the overflow behaviour
is synchronous I/O as back-pressure). You seem to be suggesting that
k=1 sounds too low, not too high, but there is that separate
time-based defence against overshoot in response to rare bursts.
I probably had to start with a statement that I find the current approach
reasonable, and I'm only curious if there is more to get out of it. I
haven't benchmarked the patch yet (plan getting to it when I'll get back),
and can imagine practical considerations significantly impacting any
potential solution.
About control theory... yeah. That's an interesting bag of tricks.
FWIW Melanie and (more recently) I have looked into textbook control
algorithms at a higher level of the I/O stack (and Melanie gave a talk
about other applications in eg VACUUM at pgconf.dev). In
read_stream.c, where I/O demand is created, we've been trying to set
the desired I/O concurrency level and thus lookahead distance with
adaptive feedback. We've tried a lot of stuff. I hope we can share
some concept patches some time soon, well, maybe in this cycle. Some
interesting recent experiments produced graphs that look a lot like
the ones in the book "Feedback Control for Computer Systems" (an easy
software-person book I found for people without an engineering/control
theory background where the problems match our world more closely, cf
typical texts that are about controlling motors and other mechanical
stuff...). Experimental goals are: find the the smallest concurrent
I/O request level (and thus lookahead distance and thus speculative
work done and buffers pinned) that keeps the I/O stall probability
near zero (and keep adapting, since other queries and applications are
sharing system I/O queues), and if that's not even possible, find the
highest concurrent I/O request level that doesn't cause extra latency
due to queuing in lower levels (I/O workers, kernel, ..., disks).
That second part is quite hard. In other words, if higher levels own
that problem and bring the adaptivity, then perhaps io_method=worker
can get away with being quite stupid. Just a thought...
Looking forward to it. And thanks for the reminder about the talk, wanted
to watch it already long time ago, but somehow didn't managed yet.
Show quoted text
On Wed, May 28, 2025 at 5:55 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
I probably had to start with a statement that I find the current approach reasonable, and I'm only curious if there is more to get out of it. I haven't benchmarked the patch yet (plan getting to it when I'll get back), and can imagine practical considerations significantly impacting any potential solution.
Here's a rebase.
Attachments:
v2-0001-aio-Try-repeatedly-to-give-batched-IOs-to-workers.patchtext/x-patch; charset=US-ASCII; name=v2-0001-aio-Try-repeatedly-to-give-batched-IOs-to-workers.patchDownload+27-4
v2-0002-aio-Adjust-IO-worker-pool-size-automatically.patchtext/x-patch; charset=US-ASCII; name=v2-0002-aio-Adjust-IO-worker-pool-size-automatically.patchDownload+535-123
On Sat, Jul 12, 2025 at 05:08:29PM +1200, Thomas Munro wrote:
On Wed, May 28, 2025 at 5:55 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:I probably had to start with a statement that I find the current
approach reasonable, and I'm only curious if there is more to get
out of it. I haven't benchmarked the patch yet (plan getting to it
when I'll get back), and can imagine practical considerations
significantly impacting any potential solution.Here's a rebase.
Thanks. I was experimenting with this approach, and realized there isn't
much metrics exposed about workers and the IO queue so far. Since the
worker pool growth is based on the queue size and workers try to share
the load uniformly, it makes to have a system view to show those
numbers, let's say a system view for worker handles and a function to
get the current queue size? E.g. workers load in my testing was quite
varying, see "Load distribution between workers" graph, which shows a
quick profiling run including currently running io workers.
Regarding the worker pool growth approach, it sounds reasonable to me.
With static number of workers one needs to somehow find a number
suitable for all types of workload, where with this patch one needs only
to fiddle with the launch interval to handle possible spikes. It would
be interesting to investigate, how this approach would react to
different dynamics of the queue size. I've plotted one "spike" scenario
in the "Worker pool size response to queue depth", where there is a
pretty artificial burst of IO, making the queue size look like a step
function. If I understand the patch implementation correctly, it would
respond linearly over time (green line), one could also think about
applying a first order butterworth low pass filter to respond quicker
but still smooth (orange line).
But in reality the queue size would be of course much more volatile even
on stable workloads, like in "Queue depth over time" (one can see
general oscillation, as well as different modes, e.g. where data is in
the page cache vs where it isn't). Event more, there is a feedback where
increasing number of workers would accelerate queue size decrease --
based on [1]Harchol-Balter, Mor. Performance modeling and design of computer systems: queueing theory in action. Cambridge University Press, 2013. the system utilization for M/M/k depends on the arrival
rate, processing rate and number of processors, where pretty intuitively
more processors reduce utilization. But alas, as you've mentioned this
result exists for Poisson distribution only.
Btw, I assume something similar could be done to other methods as well?
I'm not up to date on io uring, can one change the ring depth on the
fly?
As a side note, I was trying to experiment with this patch using
dm-mapper's delay feature to introduce an arbitrary large io latency and
see how the io queue is growing. But strangely enough, even though the
pure io latency was high, the queue growth was smaller than e.g. on a
real hardware under the same conditions without any artificial delay. Is
there anything obvious I'm missing that could have explained that?
[1]: Harchol-Balter, Mor. Performance modeling and design of computer systems: queueing theory in action. Cambridge University Press, 2013.
systems: queueing theory in action. Cambridge University Press, 2013.
On Wed, Jul 30, 2025 at 10:15 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
Thanks. I was experimenting with this approach, and realized there isn't
much metrics exposed about workers and the IO queue so far. Since the
Hmm. You can almost infer the depth from the pg_aios view. All IOs
in use are visible there, and the SUBMITTED ones are all either in the
queue, currently being executed by a worker, or being executed
synchronously by a regular backend because the queue was full and in
that case it just falls back to synchronous execution. Perhaps we
just need to be able to distinguish those three cases in that view.
For the synchronous-in-submitter overflow case, I think f_sync should
really show 't', and I'll post a patch for that shortly. For
"currently executing in a worker", I wonder if we could have an "info"
column that queries a new optional callback
pgaio_iomethod_ops->get_info(ioh) where worker mode could return
"worker 3", or something like that.
worker pool growth is based on the queue size and workers try to share
the load uniformly, it makes to have a system view to show those
Actually it's not uniform: it tries to wake up the lowest numbered
worker that advertises itself as idle, in that little bitmap of idle
workers. So if you look in htop you'll see that worker 0 is the most
busy, then worker 1, etc. Only if they are all quite busy does it
become almost uniform, which probably implies you've reached hit
io_max_workers and should probably set it higher (or without this
patch, you should probably just increase io_workers manually, assuming
your I/O hardware can take more).
Originally I made it like that to give higher numbered workers a
chance to time out (anticipating this patch). Later I found another
reason to do it that way:
When I tried uniform distribution using atomic_fetch_add(&distributor,
1) % nworkers to select the worker to wake up, avg(latency) and
stddev(latency) were both higher for simple tests like the one
attached to the first message, when running several copies of it
concurrently. The concentrate-into-lowest-numbers design benefits
from latch collapsing and allows the busier workers to avoid going
back to sleep when they could immediately pick up a new job. I didn't
change that in this patch, though I did tweak the "fan out" logic a
bit, after some experimentation on several machines where I realised
the code in master/18 is a bit over enthusiastic about that and has a
higher spurious wakeup ratio (something this patch actually measures
and tries to reduce).
Here is one of my less successful attempts to do a round-robin system
that tries to adjust the pool size with more engineering, but it was
consistently worse on those latency statistics compared to this
approach, and wasn't even as good at finding a good pool size, so
eventually I realised that it was a dead end and my original work
contrentrating concept was better:
https://github.com/macdice/postgres/tree/io-worker-pool
FWIW the patch in this branch is in this public branch:
https://github.com/macdice/postgres/tree/io-worker-pool-3
Regarding the worker pool growth approach, it sounds reasonable to me.
Great to hear. I wonder what other kinds of testing we should do to
validate this, but I am feeling quite confident about this patch and
thinking it should probably go in sooner rather than later.
With static number of workers one needs to somehow find a number
suitable for all types of workload, where with this patch one needs only
to fiddle with the launch interval to handle possible spikes. It would
be interesting to investigate, how this approach would react to
different dynamics of the queue size. I've plotted one "spike" scenario
in the "Worker pool size response to queue depth", where there is a
pretty artificial burst of IO, making the queue size look like a step
function. If I understand the patch implementation correctly, it would
respond linearly over time (green line), one could also think about
applying a first order butterworth low pass filter to respond quicker
but still smooth (orange line).
Interesting.
There is only one kind of smoothing in the patch currently, relating
to the pool size going down. It models spurious latch wakeups in an
exponentially decaying ratio of wakeups:work. That's the only way I
could find to deal with the inherent sloppiness of the wakeup
mechanism with a shared queue: when you wake the lowest numbered idle
worker as of some moment in time, it might lose the race against an
even lower numbered worker that finishes its current job and steals
the new job. When workers steal jobs, latency decreases, which is
good, so instead of preventing it I eventually figured out that we
should measure it, smooth it, and use it to limit wakeup propagation.
I wonder if that naturally produces curves a bit like your butterworth
line when it's going down already, but I'm not sure.
As for the curve on the way up, hmm, I'm not sure. Yes, it goes up
linearly and is limited by the launch delay, but I was thinking of
that only as the way it grows when the *variation* in workload changes
over a long time frame. In other words, maybe it's not so important
how exactly it grows, it's more important that it achieves a steady
state that can handle the oscillations and spikes in your workload.
The idle timeout creates that steady state by holding the current pool
size for quite a while, so that it can handle your quieter and busier
moments immediately without having to adjust the pool size.
In that other failed attempt I tried to model that more explicitly,
with "active" workers and "spare" workers, with the active set sizes
for average demand with uniform wakeups and the spare set sized for
some number of standard deviations that are woken up only when the
queue is high, but I could never really make it work well...
But in reality the queue size would be of course much more volatile even
on stable workloads, like in "Queue depth over time" (one can see
general oscillation, as well as different modes, e.g. where data is in
the page cache vs where it isn't). Event more, there is a feedback where
increasing number of workers would accelerate queue size decrease --
based on [1] the system utilization for M/M/k depends on the arrival
rate, processing rate and number of processors, where pretty intuitively
more processors reduce utilization. But alas, as you've mentioned this
result exists for Poisson distribution only.
Btw, I assume something similar could be done to other methods as well?
I'm not up to date on io uring, can one change the ring depth on the
fly?
Each backend's io_uring submission queue is configured at startup and
not changeable later, but it is sized for the maximum possible number
that each backend can submit, io_max_concurrency, which corresponds to
the backend's portion of the array of PgAioHandle objects that is
fixed. I suppose you could say that each backend's submission queue
can't overflow at that level, because it's perfectly sized and not
shared with other backends, or to put it another way, the equivalent
of overflow is we won't try to submit more IOs than that.
Worker mode has a shared submission queue, but falls back to
synchronous execution if it's full, which is a bit weird as it makes
your IOs jump the queue in a sense, and that is a good reason to want
this patch so that the pool can try to find the size that avoids that
instead of leaving the user in the dark.
As for the equivalent of pool sizing inside io_uring (and maybe other
AIO systems in other kernels), hmm.... in the absolute best cases
worker threads can be skipped completely, eg for direct I/O queued
straight to the device, but when used, I guess they have pretty
different economics. A kernel can start a thread just by allocating a
bit of memory and sticking it in a queue, and can also wake them (move
them to a different scheduler queue) cheaply, but we have to fork a
giant process that has to open all the files and build up its caches
etc. So I think they just start threads on demand immediately on need
without damping, with some kind of short grace period just to avoid
those smaller costs being repeated. I'm no expert on those internal
details, but our worker system clearly needs all this damping and
steady state discovery heuristics due to the higher overheads and
sloppy wakeups.
Thinking more about our comparatively heavyweight I/O workers, there
must also be affinity opportunities. If you somehow tended to use the
same workers for a given database in a cluster with multiple active
databases, then workers might accumulate fewer open file descriptors
and SMgrRelation cache objects. If you had per-NUMA node pools and
queues then you might be able to reduce contention, and maybe also
cache line ping-pong on buffer headers considering that the submitter
dirties the header, then the worker does (in the completion callback),
and then the submitter accesses it again. I haven't investigated
that.
As a side note, I was trying to experiment with this patch using
dm-mapper's delay feature to introduce an arbitrary large io latency and
see how the io queue is growing. But strangely enough, even though the
pure io latency was high, the queue growth was smaller than e.g. on a
real hardware under the same conditions without any artificial delay. Is
there anything obvious I'm missing that could have explained that?
Could it be alternating full and almost empty due to method_worker.c's
fallback to synchronous on overflow, which slows the submission down,
or something like that, and then you're plotting an average depth that
is lower than you expected? With the patch I'll share shortly to make
pg_aios show a useful f_sync value it might be more obvious...
About dm-mapper delays, I actually found it useful to hack up worker
mode itself to simulate storage behaviours, for example swamped local
disks or cloud storage with deep queues and no back pressure but
artificial IOPS and bandwidth caps, etc. I was thinking about
developing some proper settings to help with that kind of research:
debug_io_worker_queue_size (changeable at runtime),
debug_io_max_worker_queue_size (allocated at startup),
debug_io_worker_{latency,bandwidth,iops} to introduce calculated
sleeps, and debug_io_worker_overflow_policy=synchronous|wait so that
you can disable the synchronous fallback that confuses matters.
That'd be more convenient, portable and flexible than dm-mapper tricks
I guess. I'd been imagining that as a tool to investigate higher
level work on feedback control for read_stream.c as mentioned, but
come to think of it, it could also be useful to understand things
about the worker pool itself. That's vapourware though, for myself I
just used dirty hacks last time I was working on that stuff. In other
words, patches are most welcome if you're interested in that kind of
thing. I am a bit tied up with multithreading at the moment and time
grows short. I will come back to that problem in a little while and
that patch is on my list as part of the infrastructure needed to prove
things about the I/O stream feedback work I hope to share later...
Here is a rebase. I would like to push these shortly if there are no
objections. I propose 8 as the default upper limit.
Attachments:
v3-0001-aio-Simplify-pgaio_worker_submit.patchtext/x-patch; charset=US-ASCII; name=v3-0001-aio-Simplify-pgaio_worker_submit.patchDownload+5-16
v3-0002-aio-Improve-I-O-worker-behavior-on-full-queue.patchtext/x-patch; charset=US-ASCII; name=v3-0002-aio-Improve-I-O-worker-behavior-on-full-queue.patchDownload+21-4
v3-0003-aio-Adjust-I-O-worker-pool-size-automatically.patchtext/x-patch; charset=US-ASCII; name=v3-0003-aio-Adjust-I-O-worker-pool-size-automatically.patchDownload+608-133
On Sat, Mar 28, 2026 at 10:31:36PM +1300, Thomas Munro wrote:
Here is a rebase. I would like to push these shortly if there are no
objections. I propose 8 as the default upper limit.
Yes, please! Just wanted to ask about the status of this patch.
Here's an updated patch. It's mostly just rebased over the recent
firehose, but with lots of comments and a few names (hopefully)
improved. There is one code change to highlight though:
maybe_start_io_workers() knows when it's not allowed to create new
workers, an interesting case being FatalError before we have started
the new world. The previous coding of DetermineSleepTime() didn't
know about that, so it could return 0 (don't sleep), and then the
postmaster could busy-wait for restart progress. Maybe there were
other cases like that, but in general DetermineSleepTime() and
maybe_start_io_workers() really need to be 100% in agreement. So I
have moved that knowledge into a new function
maybe_start_io_workers_scheduled_at(). Both DetermineSleepTime() and
maybe_start_io_workers() call that so there is a single source of
truth.
I think I got confused about that because it's not that obvious why
the existing code doesn't test FatalError.
I thought of a slightly bigger refactoring that might deconfuse
DetermineSleepTime() a bit more. Probably material for the next
cycle, but basically the idea is to stop using a bunch of different
conditions and different units of time and convert the whole thing to
a simple find-the-lowest-time function. I kept that separate.
I'll post a new version of the patch that was v3-0002 separately.
Attachments:
v4-0002-Refactor-the-postmaster-s-periodic-job-scheduling.patchtext/x-patch; charset=US-ASCII; name=v4-0002-Refactor-the-postmaster-s-periodic-job-scheduling.patchDownload+125-123
v4-0001-aio-Adjust-I-O-worker-pool-size-automatically.patchtext/x-patch; charset=US-ASCII; name=v4-0001-aio-Adjust-I-O-worker-pool-size-automatically.patchDownload+659-146
Hi,
On 2026-04-07 03:02:52 +1200, Thomas Munro wrote:
Here's an updated patch. It's mostly just rebased over the recent
firehose, but with lots of comments and a few names (hopefully)
improved. There is one code change to highlight though:maybe_start_io_workers() knows when it's not allowed to create new
workers, an interesting case being FatalError before we have started
the new world.
*worker, I assume?
The previous coding of DetermineSleepTime() didn't
know about that, so it could return 0 (don't sleep), and then the
postmaster could busy-wait for restart progress.
In master or the prior version of your patch?
Maybe there were
other cases like that, but in general DetermineSleepTime() and
maybe_start_io_workers() really need to be 100% in agreement. So I
have moved that knowledge into a new function
maybe_start_io_workers_scheduled_at(). Both DetermineSleepTime() and
maybe_start_io_workers() call that so there is a single source of
truth.I think I got confused about that because it's not that obvious why
the existing code doesn't test FatalError.I thought of a slightly bigger refactoring that might deconfuse
DetermineSleepTime() a bit more. Probably material for the next
cycle, but basically the idea is to stop using a bunch of different
conditions and different units of time and convert the whole thing to
a simple find-the-lowest-time function. I kept that separate.I'll post a new version of the patch that was v3-0002 separately.
From 6c5d16a15add62c68bb7f9c7b6a1e3bde1f406d8 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 22 Mar 2025 00:36:49 +1300
Subject: [PATCH v4 1/2] aio: Adjust I/O worker pool size automatically.The size of the I/O worker pool used to implement io_method=worker was
previously controlled by the io_workers setting, defaulting to 3. It
was hard to know how to tune it effectively. It is now replaced with:io_min_workers=1
io_max_workers=8 (up to 32)
io_worker_idle_timeout=60s
io_worker_launch_interval=100ms
I'm a bit concerned about defaulting to io_min_workers=1. That means in an
intermittent workload, there will be no IO concurrency for short running but
IO intensive queries, while having the dispatch overhead to the worker. It
can still be a win if the query is CPU intensive, but far from all are.
I'd therefore argue that the minimum ought to be at least 2.
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index 6f13e8f40a0..c42564500c6 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c
@@ -1555,14 +1558,13 @@ checkControlFile(void) static int DetermineSleepTime(void) { - TimestampTz next_wakeup = 0; + TimestampTz next_wakeup;/* - * Normal case: either there are no background workers at all, or we're in - * a shutdown sequence (during which we ignore bgworkers altogether). + * If an ImmediateShutdown or a crash restart has set a SIGKILL timeout, + * ignore everything else and wait for that. */ - if (Shutdown > NoShutdown || - (!StartWorkerNeeded && !HaveCrashedWorker)) + if (Shutdown >= ImmediateShutdown || FatalError) { if (AbortStartTime != 0) { @@ -1582,14 +1584,16 @@ DetermineSleepTime(void)return seconds * 1000;
}
- else
- return 60 * 1000;
}- if (StartWorkerNeeded) + /* Time of next maybe_start_io_workers() call, or 0 for none. */ + next_wakeup = maybe_start_io_workers_scheduled_at(); + + /* Ignore bgworkers during shutdown. */ + if (StartWorkerNeeded && Shutdown == NoShutdown) return 0;
Why is the maybe_start_io_workers_scheduled_at() thing before the return 0
here?
- if (HaveCrashedWorker) + if (HaveCrashedWorker && Shutdown == NoShutdown) { dlist_mutable_iter iter;
@@ -3797,6 +3811,15 @@ process_pm_pmsignal(void)
StartWorkerNeeded = true;
}+ /* Process IO worker start requests. */ + if (CheckPostmasterSignal(PMSIGNAL_IO_WORKER_GROW)) + { + /* + * No local flag, as the state is exposed through pgaio_worker_*() + * functions. This signal is received on potentially actionable level + * changes, so that maybe_start_io_workers() will run. + */ + } /* Process background worker state changes. */ if (CheckPostmasterSignal(PMSIGNAL_BACKGROUND_WORKER_CHANGE)) {
Absolute nitpick - the different blocks so far have been separated by an empty
line.
+ /* Only proceed if a "grow" request is pending from existing workers. */ + if (!pgaio_worker_test_grow()) + return 0;
So this accesses shared memory from postmaster. I think this amount of access
is safe enough that that's ok. You'd have to somehow have corrupted
postmaster's copy of io_worker_control, or unmapped the shared memory it is
pointed to, for that to cause a crash. The first shouldn't be an issue, the
latter would be quite the confusion fo the state machine.
+/* + * Start I/O workers if required. Used at startup, to respond to change of + * the io_min_workers GUC, when asked to start a new one due to submission + * queue backlog, and after workers terminate in response to errors (by + * starting "replacement" workers). + */ +static void +maybe_start_io_workers(void) +{ + TimestampTz scheduled_at;- /* Not enough running? */ - while (io_worker_count < io_workers) + while ((scheduled_at = maybe_start_io_workers_scheduled_at()) != 0) { + TimestampTz now = GetCurrentTimestamp(); PMChild *child; int i;+ Assert(pmState < PM_WAIT_IO_WORKERS); + + /* Still waiting for the scheduled time? */ + if (scheduled_at > now) + break; + + /* Clear the grow request flag if it is set. */ + pgaio_worker_clear_grow(); + + /* + * Compute next launch time relative to the previous value, so that + * time spent on the postmaster's other duties don't result in an + * inaccurate launch interval. + */ + io_worker_launch_next_time = + TimestampTzPlusMilliseconds(io_worker_launch_next_time, + io_worker_launch_interval); + + /* + * If that's already in the past, the interval is either impossibly + * short or we received no requests for new workers for a period. + * Compute a new future time relative to the last launch time instead. + */ + if (io_worker_launch_next_time <= now) + io_worker_launch_next_time = + TimestampTzPlusMilliseconds(io_worker_launch_last_time, + io_worker_launch_interval);
Did you intend to use TimestampTzPlusMilliseconds(now, ...) here? Or did you
want to have this if after the next line:
+ io_worker_launch_last_time = now;
+
Because otherwise I don't understand how this is intended to work.
/* find unused entry in io_worker_children array */ for (i = 0; i < MAX_IO_WORKERS; ++i) { @@ -4454,20 +4539,14 @@ maybe_adjust_io_workers(void) ++io_worker_count; } else - break; /* try again next time */ - } - - /* Too many running? */ - if (io_worker_count > io_workers) - { - /* ask the IO worker in the highest slot to exit */ - for (int i = MAX_IO_WORKERS - 1; i >= 0; --i) { - if (io_worker_children[i] != NULL) - { - kill(io_worker_children[i]->pid, SIGUSR2); - break; - } + /* + * Fork failure: we'll try again after the launch interval + * expires, or be called again without delay if we don't yet have + * io_min_workers. Don't loop here though, the postmaster has + * other duties. + */ + break; } } }
Reading just this part of the diff I am wondering what is reponsible for
reducing the number of workers below the max after a config change. I assume
it's done in the workers, but it might be worth putting a comment here noting
that.
+/* Debugging support: show current IO and wakeups:ios statistics in ps. */ +/* #define PGAIO_WORKER_SHOW_PS_INFO */typedef struct PgAioWorkerSubmissionQueue
{
@@ -63,13 +67,34 @@ typedef struct PgAioWorkerSubmissionQueuetypedef struct PgAioWorkerSlot { - Latch *latch; - bool in_use; + ProcNumber proc_number; } PgAioWorkerSlot;+/* + * Sets of worker IDs are held in a simple bitmap, accessed through functions + * that provide a more readable abstraction. If we wanted to support more + * workers than that, the contention on the single queue would surely get too + * high, so we might want to consider multiple pools instead of widening this. + */ +typedef uint64 PgAioWorkerSet;
+#define PGAIO_WORKER_SET_BITS (sizeof(PgAioWorkerSet) * CHAR_BIT) + +static_assert(PGAIO_WORKER_SET_BITS >= MAX_IO_WORKERS, "too small"); + typedef struct PgAioWorkerControl { - uint64 idle_worker_mask; + /* Seen by postmaster */ + volatile bool grow;
What's that volatile intending to do here? It avoids the needs for some
compiler barriers, but it's not clear to me those would be needed here anyway.
And it doesn't imply memory ordering, which I'm not sure is entirely wise
here. I'd probably just plop a full memory barrier in the few relevant
places, easier to reason about that way, and it can't matter given the
infrequency of access. I'd say we should just use a proper atomic, but right
now I don't think we do that in postmaster.
+ /* Protected by AioWorkerSubmissionQueueLock. */ + PgAioWorkerSet idle_worker_set; + + /* Protected by AioWorkerControlLock. */ + PgAioWorkerSet worker_set; + int nworkers; + + /* Protected by AioWorkerControlLock. */ PgAioWorkerSlot workers[FLEXIBLE_ARRAY_MEMBER]; } PgAioWorkerControl;@@ -91,15 +116,103 @@ const IoMethodOps pgaio_worker_ops = {
+static bool +pgaio_worker_set_is_empty(PgAioWorkerSet *set) +{ + return *set == 0; +} + +static PgAioWorkerSet +pgaio_worker_set_singleton(int worker) +{ + return UINT64_C(1) << worker; +}
I guess an assert about `worker` being small enough wouldn't hurt.
+static void +pgaio_worker_set_fill(PgAioWorkerSet *set) +{ + *set = UINT64_MAX >> (PGAIO_WORKER_SET_BITS - MAX_IO_WORKERS); +}
What does "_fill" really mean? Just that all valid bits are set? Why wouldn't
it be _all() or _full()?
+static int +pgaio_worker_set_get_highest(PgAioWorkerSet *set) +{ + Assert(!pgaio_worker_set_is_empty(set)); + return pg_leftmost_one_pos64(*set); +}
"worker_set_get*" reads quite awkwardly. Maybe just going for
pgaio_workerset_* would help?
Or maybe just name it PgAioWset/pgaio_wset_ or such?
+static void +pgaio_worker_grow(bool grow) +{ + /* + * This is called from sites that don't hold AioWorkerControlLock, but + * these values change infrequently and an up-to-date value is not + * required for this heuristic purpose. + */
Is it actually useful to do this while not holding the control lock? Ah, I
see, this is due to the split of submission and control lock.
+ if (!grow) + { + /* Avoid dirtying memory if not already set. */ + if (io_worker_control->grow) + io_worker_control->grow = false;
Hm. pgaio_worker_grow(grow=false) is a bit odd. And this is basically a copy
of pgaio_worker_cancel_grow() - I realize that's intended for postmaster, but
somehow it's a bit odd.
Maybe just name it pgaio_worker_set_grow()?
+/* + * Called by the postmaster to check if a new worker is needed. + */ +bool +pgaio_worker_test_grow(void) +{ + return io_worker_control && io_worker_control->grow; +} + +/* + * Called by the postmaster to clear the grow flag. + */ +void +pgaio_worker_clear_grow(void) +{ + if (io_worker_control) + io_worker_control->grow = false; +}
Maybe we should add _pm_ in there to make it clearer that they're not for
general use?
@@ -226,8 +413,7 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios) { PgAioHandle **synchronous_ios = NULL; int nsync = 0; - Latch *wakeup = NULL; - int worker; + int worker = -1;Assert(num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
@@ -252,19 +438,15 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
break;
}- if (wakeup == NULL) - { - /* Choose an idle worker to wake up if we haven't already. */ - worker = pgaio_worker_choose_idle(); - if (worker >= 0) - wakeup = io_worker_control->workers[worker].latch; - - pgaio_debug_io(DEBUG4, staged_ios[i], - "choosing worker %d", - worker); - } + /* Choose one worker to wake for this batch. */ + if (worker == -1) + worker = pgaio_worker_choose_idle(0); }
If we only want to do this once per "batch", why not just do it outside the
num_staged_ios loop?
@@ -295,14 +474,27 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios) static void pgaio_worker_die(int code, Datum arg) { - LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE); - Assert(io_worker_control->workers[MyIoWorkerId].in_use); - Assert(io_worker_control->workers[MyIoWorkerId].latch == MyLatch); + PgAioWorkerSet notify_set;- io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId); - io_worker_control->workers[MyIoWorkerId].in_use = false; - io_worker_control->workers[MyIoWorkerId].latch = NULL; + LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE); + pgaio_worker_set_remove(&io_worker_control->idle_worker_set, MyIoWorkerId); LWLockRelease(AioWorkerSubmissionQueueLock); + + LWLockAcquire(AioWorkerControlLock, LW_EXCLUSIVE); + Assert(io_worker_control->workers[MyIoWorkerId].proc_number == MyProcNumber); + io_worker_control->workers[MyIoWorkerId].proc_number = INVALID_PROC_NUMBER; + Assert(pgaio_worker_set_contains(&io_worker_control->worker_set, MyIoWorkerId)); + pgaio_worker_set_remove(&io_worker_control->worker_set, MyIoWorkerId); + notify_set = io_worker_control->worker_set; + Assert(io_worker_control->nworkers > 0); + io_worker_control->nworkers--; + Assert(pgaio_worker_set_count(&io_worker_control->worker_set) == + io_worker_control->nworkers); + LWLockRelease(AioWorkerControlLock); + + /* Notify other workers on pool change. */
Why are we notifying them on pool changes?
+ while (!pgaio_worker_set_is_empty(¬ify_set)) + pgaio_worker_wake(pgaio_worker_set_pop_lowest(¬ify_set));
I did already wonder further up if pgaio_worker_wake() should just receive a
worker_set as an argument.
@@ -312,33 +504,34 @@ pgaio_worker_die(int code, Datum arg) static void pgaio_worker_register(void) { + PgAioWorkerSet free_worker_set; + PgAioWorkerSet old_worker_set; + MyIoWorkerId = -1;- /* - * XXX: This could do with more fine-grained locking. But it's also not - * very common for the number of workers to change at the moment... - */ - LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE); + LWLockAcquire(AioWorkerControlLock, LW_EXCLUSIVE);
I guess it could be useful to assert that nworkers is small enough before
doing anything.
+ pgaio_worker_set_fill(&free_worker_set); + pgaio_worker_set_subtract(&free_worker_set, &io_worker_control->worker_set); + if (!pgaio_worker_set_is_empty(&free_worker_set)) + MyIoWorkerId = pgaio_worker_set_get_lowest(&free_worker_set); + if (MyIoWorkerId == -1) + elog(ERROR, "couldn't find a free worker ID");
I'd probably add a comment saying "/* find lowest unused worker ID */" or
such, that was more immediately obvious in the old code.
+/* + * Check if this backend is allowed to time out, and thus should use a + * non-infinite sleep time. Only the highest-numbered worker is allowed to + * time out, and only if the pool is above io_min_workers. Serializing + * timeouts keeps IDs in a range 0..N without gaps, and avoids undershooting + * io_min_workers.
But it's ok if a lower numbered worker errors out, right? There will be a
temporary gap, but we will start a new worker for it? Does that happen even
if there's a shrink of the set of required workers at the same time as a lower
numbered worker errors out?
@@ -439,10 +666,9 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len) while (!ShutdownRequestPending) { uint32 io_index; - Latch *latches[IO_WORKER_WAKEUP_FANOUT]; - int nlatches = 0; - int nwakeups = 0; - int worker; + int worker = -1; + int queue_depth = 0; + bool grow = false;/* * Try to get a job to do. @@ -453,38 +679,64 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len) LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE); if ((io_index = pgaio_worker_submission_queue_consume()) == -1) { - /* - * Nothing to do. Mark self idle. - * - * XXX: Invent some kind of back pressure to reduce useless - * wakeups? - */ - io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId); + /* Nothing to do. Mark self idle. */ + pgaio_worker_set_insert(&io_worker_control->idle_worker_set, + MyIoWorkerId); } else { /* Got one. Clear idle flag. */ - io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId); + pgaio_worker_set_remove(&io_worker_control->idle_worker_set, + MyIoWorkerId);
Wonder if we should keep track of whether we marked ourselves idle to avoid
needing to do that. But that would be a separate optimization really.
+ /* + * See if we should wake up a higher numbered peer. Only do that + * if this worker is not receiving spurious wakeups itself.
The "not receiving spurious wakeups" condition is wakeups < ios?
I think both 'wakeups" and "ios" are a bit too generically named. Based on the
names I have no idea what this heuristic might be.
+ * This heuristic tries to discover the useful wakeup propagation + * chain length when IOs are very fast and workers wake up to find + * that all IOs have already been taken. + * + * If we chose not to wake a worker when we ideally should have, + * the ratio will soon be corrected. + */ + if (wakeups <= ios) { + queue_depth = pgaio_worker_submission_queue_depth(); + if (queue_depth > 0) + { + worker = pgaio_worker_choose_idle(MyIoWorkerId + 1);
Is it a problem that we are passing an ID that's potentially bigger than the
biggest legal worker ID? It's probably fine as long as MAX_WORKERS is 32 and
the bitmap is a 64bit integer, but ...
+ /* + * If there were no idle higher numbered peers and there + * are more than enough IOs queued for me and all lower + * numbered peers, then try to start a new worker. + */ + if (worker == -1 && queue_depth > MyIoWorkerId) + grow = true; + }
We probably shouldn't request growth when already at the cap? That could
generate a *lot* of pmsignal traffic, I think?
I don't have an immediate intuitive understanding of why the submission queue
depth is a good measure here.
If there are 10 workers that are busy 100% of the time, and the submission
queue is usually 6 deep with not-being-worked-on IOs, why do we not want to
start more workers?
It actually seems to work - but I don't actually understand why.
ninja install-test-files
io_max_workers=32
debug_io_direct=data
effective_io_concurrency=16
shared_buffers=5GB
pgbench -i -q -s 100 --fillfactor=30
CREATE EXTENSION IF NOT EXISTS test_aio;
CREATE EXTENSION IF NOT EXISTS pg_buffercache;
DROP TABLE IF EXISTS pattern_random_pgbench;
CREATE TABLE pattern_random_pgbench AS SELECT ARRAY(SELECT random(0, pg_relation_size('pgbench_accounts')/8192 - 1)::int4 FROM generate_series(1, pg_relation_size('pgbench_accounts')/8192)) AS pattern;
My test is:
SET effective_io_concurrency = 20;
SELECT pg_buffercache_evict_relation('pgbench_accounts');
SELECT read_stream_for_blocks('pgbench_accounts', pattern) FROM pattern_random_pgbench LIMIT 1;
We end up with ~24-28 workers, even though we never have more than 20 IOs in
flight. Not entirely sure why. I guess it's just that after doing an IO the
worker needs to mark itself idle etc?
if (io_index != -1)
{
PgAioHandle *ioh = NULL;+ /* Cancel timeout and update wakeup:work ratio. */ + idle_timeout_abs = 0; + if (++ios == PGAIO_WORKER_STATS_MAX) + { + wakeups /= 2; + ios /= 2; + }
/* Saturation for counters used to estimate wakeup:work ratio. */
#define PGAIO_WORKER_STATS_MAX 4
STATS_MAX sounds like it's just about some reporting or such.
ioh = &pgaio_ctl->io_handles[io_index];
error_ioh = ioh;
errcallback.arg = ioh;
@@ -537,6 +789,14 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
}
#endif+#ifdef PGAIO_WORKER_SHOW_PS_INFO + sprintf(cmd, "%d: [%s] %s", + MyIoWorkerId, + pgaio_io_get_op_name(ioh), + pgaio_io_get_target_description(ioh)); + set_ps_display(cmd); +#endif
Note that this leaks memory. See the target_description comment:
/*
* Return a stringified description of the IO's target.
*
* The string is localized and allocated in the current memory context.
*/
/* * We don't expect this to ever fail with ERROR or FATAL, no need * to keep error_ioh set to the IO. @@ -550,8 +810,75 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len) } else { - WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1, - WAIT_EVENT_IO_WORKER_MAIN); + int timeout_ms; + + /* Cancel new worker request if pending. */ + pgaio_worker_grow(false);
That seems to happen very frequently.
+ /* Compute the remaining allowed idle time. */ + if (io_worker_idle_timeout == -1) + { + /* Never time out. */ + timeout_ms = -1; + } + else + { + TimestampTz now = GetCurrentTimestamp(); + + /* If the GUC changes, reset timer. */ + if (idle_timeout_abs != 0 && + io_worker_idle_timeout != timeout_guc_used) + idle_timeout_abs = 0; + + /* On first sleep, compute absolute timeout. */ + if (idle_timeout_abs == 0) + { + idle_timeout_abs = + TimestampTzPlusMilliseconds(now, + io_worker_idle_timeout); + timeout_guc_used = io_worker_idle_timeout; + } + + /* + * All workers maintain the absolute timeout value, but only + * the highest worker can actually time out and only if + * io_min_workers is satisfied. All others wait only for + * explicit wakeups caused by queue insertion, wakeup + * propagation, change of pool size (possibly promoting one to + * new highest) or GUC reload. + */ + if (pgaio_worker_can_timeout()) + timeout_ms = + TimestampDifferenceMilliseconds(now, + idle_timeout_abs); + else + timeout_ms = -1;
Hm. This way you get very rapid worker pool reductions. Configured
io_worker_idle_timeout=1s, started a bunch of work of and observed the worker
count after the work finishes:
Mon 06 Apr 2026 02:08:28 PM EDT (every 1s)
count
32
(1 row)
Mon 06 Apr 2026 02:08:29 PM EDT (every 1s)
count
32
(1 row)
Mon 06 Apr 2026 02:08:30 PM EDT (every 1s)
count
1
(1 row)
Mon 06 Apr 2026 02:08:31 PM EDT (every 1s)
count
1
(1 row)
Of course this is a ridiculuously low setting, but it does seems like starting
the timeout even when not the highest numbered worker will lead to a lot of
quick yoyoing.
Greetings,
Andres Freund
On Tue, Apr 7, 2026 at 6:14 AM Andres Freund <andres@anarazel.de> wrote:
On 2026-04-07 03:02:52 +1200, Thomas Munro wrote:
Here's an updated patch. It's mostly just rebased over the recent
firehose, but with lots of comments and a few names (hopefully)
improved. There is one code change to highlight though:maybe_start_io_workers() knows when it's not allowed to create new
workers, an interesting case being FatalError before we have started
the new world.*worker, I assume?
Thanks for the review and testing!
I meant the new world when "we're already starting up again", as in
this pre-existing code from master:
/*
* Don't start new workers if we're in the shutdown phase of a crash
* restart. But we *do* need to start if we're already starting up again.
*/
if (FatalError && pmState >= PM_STOP_BACKENDS)
return;
The previous coding of DetermineSleepTime() didn't
know about that, so it could return 0 (don't sleep), and then the
postmaster could busy-wait for restart progress.In master or the prior version of your patch?
master
This code that checks AbortStartTime and overrides the sleep time.
But it wouldn't be entered if FatalError is true but StartWorkerNeeded
or HaveCrashedWorker also happens to be true. Maybe that's OK but I
found it odd.
Maybe there were
other cases like that, but in general DetermineSleepTime() and
maybe_start_io_workers() really need to be 100% in agreement. So I
have moved that knowledge into a new function
maybe_start_io_workers_scheduled_at(). Both DetermineSleepTime() and
maybe_start_io_workers() call that so there is a single source of
truth.I think I got confused about that because it's not that obvious why
the existing code doesn't test FatalError.I thought of a slightly bigger refactoring that might deconfuse
DetermineSleepTime() a bit more. Probably material for the next
cycle, but basically the idea is to stop using a bunch of different
conditions and different units of time and convert the whole thing to
a simple find-the-lowest-time function. I kept that separate.I'll post a new version of the patch that was v3-0002 separately.
From 6c5d16a15add62c68bb7f9c7b6a1e3bde1f406d8 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 22 Mar 2025 00:36:49 +1300
Subject: [PATCH v4 1/2] aio: Adjust I/O worker pool size automatically.The size of the I/O worker pool used to implement io_method=worker was
previously controlled by the io_workers setting, defaulting to 3. It
was hard to know how to tune it effectively. It is now replaced with:io_min_workers=1
io_max_workers=8 (up to 32)
io_worker_idle_timeout=60s
io_worker_launch_interval=100msI'm a bit concerned about defaulting to io_min_workers=1. That means in an
intermittent workload, there will be no IO concurrency for short running but
IO intensive queries, while having the dispatch overhead to the worker. It
can still be a win if the query is CPU intensive, but far from all are.I'd therefore argue that the minimum ought to be at least 2.
WFM.
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index 6f13e8f40a0..c42564500c6 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c@@ -1555,14 +1558,13 @@ checkControlFile(void) static int DetermineSleepTime(void) { - TimestampTz next_wakeup = 0; + TimestampTz next_wakeup;/* - * Normal case: either there are no background workers at all, or we're in - * a shutdown sequence (during which we ignore bgworkers altogether). + * If an ImmediateShutdown or a crash restart has set a SIGKILL timeout, + * ignore everything else and wait for that. */ - if (Shutdown > NoShutdown || - (!StartWorkerNeeded && !HaveCrashedWorker)) + if (Shutdown >= ImmediateShutdown || FatalError) { if (AbortStartTime != 0) { @@ -1582,14 +1584,16 @@ DetermineSleepTime(void)return seconds * 1000;
}
- else
- return 60 * 1000;
}- if (StartWorkerNeeded) + /* Time of next maybe_start_io_workers() call, or 0 for none. */ + next_wakeup = maybe_start_io_workers_scheduled_at(); + + /* Ignore bgworkers during shutdown. */ + if (StartWorkerNeeded && Shutdown == NoShutdown) return 0;Why is the maybe_start_io_workers_scheduled_at() thing before the return 0
here?
Seems OK? I mean sure I would to make this whole function more
uniform in structure, see my second patch, but...
- if (HaveCrashedWorker) + if (HaveCrashedWorker && Shutdown == NoShutdown) { dlist_mutable_iter iter;@@ -3797,6 +3811,15 @@ process_pm_pmsignal(void)
StartWorkerNeeded = true;
}+ /* Process IO worker start requests. */ + if (CheckPostmasterSignal(PMSIGNAL_IO_WORKER_GROW)) + { + /* + * No local flag, as the state is exposed through pgaio_worker_*() + * functions. This signal is received on potentially actionable level + * changes, so that maybe_start_io_workers() will run. + */ + } /* Process background worker state changes. */ if (CheckPostmasterSignal(PMSIGNAL_BACKGROUND_WORKER_CHANGE)) {Absolute nitpick - the different blocks so far have been separated by an empty
line.
Fixed.
+ /* Only proceed if a "grow" request is pending from existing workers. */ + if (!pgaio_worker_test_grow()) + return 0;So this accesses shared memory from postmaster. I think this amount of access
is safe enough that that's ok. You'd have to somehow have corrupted
postmaster's copy of io_worker_control, or unmapped the shared memory it is
pointed to, for that to cause a crash. The first shouldn't be an issue, the
latter would be quite the confusion fo the state machine.
Cool.
+/* + * Start I/O workers if required. Used at startup, to respond to change of + * the io_min_workers GUC, when asked to start a new one due to submission + * queue backlog, and after workers terminate in response to errors (by + * starting "replacement" workers). + */ +static void +maybe_start_io_workers(void) +{ + TimestampTz scheduled_at;- /* Not enough running? */ - while (io_worker_count < io_workers) + while ((scheduled_at = maybe_start_io_workers_scheduled_at()) != 0) { + TimestampTz now = GetCurrentTimestamp(); PMChild *child; int i;+ Assert(pmState < PM_WAIT_IO_WORKERS); + + /* Still waiting for the scheduled time? */ + if (scheduled_at > now) + break; + + /* Clear the grow request flag if it is set. */ + pgaio_worker_clear_grow(); + + /* + * Compute next launch time relative to the previous value, so that + * time spent on the postmaster's other duties don't result in an + * inaccurate launch interval. + */ + io_worker_launch_next_time = + TimestampTzPlusMilliseconds(io_worker_launch_next_time, + io_worker_launch_interval); + + /* + * If that's already in the past, the interval is either impossibly + * short or we received no requests for new workers for a period. + * Compute a new future time relative to the last launch time instead. + */ + if (io_worker_launch_next_time <= now) + io_worker_launch_next_time = + TimestampTzPlusMilliseconds(io_worker_launch_last_time, + io_worker_launch_interval);Did you intend to use TimestampTzPlusMilliseconds(now, ...) here? Or did you
want to have this if after the next line:+ io_worker_launch_last_time = now; +Because otherwise I don't understand how this is intended to work.
I can't remember why I did it like that. Changed.
/* find unused entry in io_worker_children array */ for (i = 0; i < MAX_IO_WORKERS; ++i) { @@ -4454,20 +4539,14 @@ maybe_adjust_io_workers(void) ++io_worker_count; } else - break; /* try again next time */ - } - - /* Too many running? */ - if (io_worker_count > io_workers) - { - /* ask the IO worker in the highest slot to exit */ - for (int i = MAX_IO_WORKERS - 1; i >= 0; --i) { - if (io_worker_children[i] != NULL) - { - kill(io_worker_children[i]->pid, SIGUSR2); - break; - } + /* + * Fork failure: we'll try again after the launch interval + * expires, or be called again without delay if we don't yet have + * io_min_workers. Don't loop here though, the postmaster has + * other duties. + */ + break; } } }Reading just this part of the diff I am wondering what is reponsible for
reducing the number of workers below the max after a config change. I assume
it's done in the workers, but it might be worth putting a comment here noting
that.
Done.
+/* Debugging support: show current IO and wakeups:ios statistics in ps. */ +/* #define PGAIO_WORKER_SHOW_PS_INFO */typedef struct PgAioWorkerSubmissionQueue
{
@@ -63,13 +67,34 @@ typedef struct PgAioWorkerSubmissionQueuetypedef struct PgAioWorkerSlot { - Latch *latch; - bool in_use; + ProcNumber proc_number; } PgAioWorkerSlot;+/* + * Sets of worker IDs are held in a simple bitmap, accessed through functions + * that provide a more readable abstraction. If we wanted to support more + * workers than that, the contention on the single queue would surely get too + * high, so we might want to consider multiple pools instead of widening this. + */ +typedef uint64 PgAioWorkerSet;+#define PGAIO_WORKER_SET_BITS (sizeof(PgAioWorkerSet) * CHAR_BIT) + +static_assert(PGAIO_WORKER_SET_BITS >= MAX_IO_WORKERS, "too small"); + typedef struct PgAioWorkerControl { - uint64 idle_worker_mask; + /* Seen by postmaster */ + volatile bool grow;What's that volatile intending to do here? It avoids the needs for some
compiler barriers, but it's not clear to me those would be needed here anyway.
And it doesn't imply memory ordering, which I'm not sure is entirely wise
here. I'd probably just plop a full memory barrier in the few relevant
places, easier to reason about that way, and it can't matter given the
infrequency of access. I'd say we should just use a proper atomic, but right
now I don't think we do that in postmaster.
Changed to full memory barrier.
+ /* Protected by AioWorkerSubmissionQueueLock. */ + PgAioWorkerSet idle_worker_set; + + /* Protected by AioWorkerControlLock. */ + PgAioWorkerSet worker_set; + int nworkers; + + /* Protected by AioWorkerControlLock. */ PgAioWorkerSlot workers[FLEXIBLE_ARRAY_MEMBER]; } PgAioWorkerControl;@@ -91,15 +116,103 @@ const IoMethodOps pgaio_worker_ops = {
+static bool +pgaio_worker_set_is_empty(PgAioWorkerSet *set) +{ + return *set == 0; +} + +static PgAioWorkerSet +pgaio_worker_set_singleton(int worker) +{ + return UINT64_C(1) << worker; +}I guess an assert about `worker` being small enough wouldn't hurt.
Done.
+static void +pgaio_worker_set_fill(PgAioWorkerSet *set) +{ + *set = UINT64_MAX >> (PGAIO_WORKER_SET_BITS - MAX_IO_WORKERS); +}What does "_fill" really mean? Just that all valid bits are set? Why wouldn't
it be _all() or _full()?
I guess I got that from sigset_t... Trying pgaio_workerset_all().
+static int +pgaio_worker_set_get_highest(PgAioWorkerSet *set) +{ + Assert(!pgaio_worker_set_is_empty(set)); + return pg_leftmost_one_pos64(*set); +}"worker_set_get*" reads quite awkwardly. Maybe just going for
pgaio_workerset_* would help?Or maybe just name it PgAioWset/pgaio_wset_ or such?
OK let's try "workerset".
+static void +pgaio_worker_grow(bool grow) +{ + /* + * This is called from sites that don't hold AioWorkerControlLock, but + * these values change infrequently and an up-to-date value is not + * required for this heuristic purpose. + */Is it actually useful to do this while not holding the control lock? Ah, I
see, this is due to the split of submission and control lock.
Yeah actually that comment is just confusing. Removed. It's pretty
clear that this flag has the usual sort of postmaster request flag
semantics and tolerates a bit of fuzziness.
+ if (!grow) + { + /* Avoid dirtying memory if not already set. */ + if (io_worker_control->grow) + io_worker_control->grow = false;Hm. pgaio_worker_grow(grow=false) is a bit odd. And this is basically a copy
of pgaio_worker_cancel_grow() - I realize that's intended for postmaster, but
somehow it's a bit odd.
Hmm, right.
Maybe just name it pgaio_worker_set_grow()?
OK how about:
pgaio_worker_request_grow()
pgaio_worker_cancel_grow()
+/* + * Called by the postmaster to check if a new worker is needed. + */ +bool +pgaio_worker_test_grow(void) +{ + return io_worker_control && io_worker_control->grow; +} + +/* + * Called by the postmaster to clear the grow flag. + */ +void +pgaio_worker_clear_grow(void) +{ + if (io_worker_control) + io_worker_control->grow = false; +}Maybe we should add _pm_ in there to make it clearer that they're not for
general use?
Done.
@@ -226,8 +413,7 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios) { PgAioHandle **synchronous_ios = NULL; int nsync = 0; - Latch *wakeup = NULL; - int worker; + int worker = -1;Assert(num_staged_ios <= PGAIO_SUBMIT_BATCH_SIZE);
@@ -252,19 +438,15 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
break;
}- if (wakeup == NULL) - { - /* Choose an idle worker to wake up if we haven't already. */ - worker = pgaio_worker_choose_idle(); - if (worker >= 0) - wakeup = io_worker_control->workers[worker].latch; - - pgaio_debug_io(DEBUG4, staged_ios[i], - "choosing worker %d", - worker); - } + /* Choose one worker to wake for this batch. */ + if (worker == -1) + worker = pgaio_worker_choose_idle(0); }If we only want to do this once per "batch", why not just do it outside the
num_staged_ios loop?
Two steps: pgaio_worker_choose_idle() must be done while holding the
queue lock (will probably finish up revising this in future work on
removing locks...). pgaio_worker_wake() is called outside the loop,
after releasing the lock.
@@ -295,14 +474,27 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios) static void pgaio_worker_die(int code, Datum arg) { - LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE); - Assert(io_worker_control->workers[MyIoWorkerId].in_use); - Assert(io_worker_control->workers[MyIoWorkerId].latch == MyLatch); + PgAioWorkerSet notify_set;- io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId); - io_worker_control->workers[MyIoWorkerId].in_use = false; - io_worker_control->workers[MyIoWorkerId].latch = NULL; + LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE); + pgaio_worker_set_remove(&io_worker_control->idle_worker_set, MyIoWorkerId); LWLockRelease(AioWorkerSubmissionQueueLock); + + LWLockAcquire(AioWorkerControlLock, LW_EXCLUSIVE); + Assert(io_worker_control->workers[MyIoWorkerId].proc_number == MyProcNumber); + io_worker_control->workers[MyIoWorkerId].proc_number = INVALID_PROC_NUMBER; + Assert(pgaio_worker_set_contains(&io_worker_control->worker_set, MyIoWorkerId)); + pgaio_worker_set_remove(&io_worker_control->worker_set, MyIoWorkerId); + notify_set = io_worker_control->worker_set; + Assert(io_worker_control->nworkers > 0); + io_worker_control->nworkers--; + Assert(pgaio_worker_set_count(&io_worker_control->worker_set) == + io_worker_control->nworkers); + LWLockRelease(AioWorkerControlLock); + + /* Notify other workers on pool change. */Why are we notifying them on pool changes?
Comments added to explain. It closes a wakeup-loss race (imagine if
you consumed a wakeup while you were exiting due to timeout; noone
else would wake up, which I fixed with this big hammer).
+ while (!pgaio_worker_set_is_empty(¬ify_set)) + pgaio_worker_wake(pgaio_worker_set_pop_lowest(¬ify_set));I did already wonder further up if pgaio_worker_wake() should just receive a
worker_set as an argument.
I have added pgaio_workerset_wake().
@@ -312,33 +504,34 @@ pgaio_worker_die(int code, Datum arg) static void pgaio_worker_register(void) { + PgAioWorkerSet free_worker_set; + PgAioWorkerSet old_worker_set; + MyIoWorkerId = -1;- /* - * XXX: This could do with more fine-grained locking. But it's also not - * very common for the number of workers to change at the moment... - */ - LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE); + LWLockAcquire(AioWorkerControlLock, LW_EXCLUSIVE);I guess it could be useful to assert that nworkers is small enough before
doing anything.
OK.
+ pgaio_worker_set_fill(&free_worker_set); + pgaio_worker_set_subtract(&free_worker_set, &io_worker_control->worker_set); + if (!pgaio_worker_set_is_empty(&free_worker_set)) + MyIoWorkerId = pgaio_worker_set_get_lowest(&free_worker_set); + if (MyIoWorkerId == -1) + elog(ERROR, "couldn't find a free worker ID");I'd probably add a comment saying "/* find lowest unused worker ID */" or
such, that was more immediately obvious in the old code.
Done.
+/* + * Check if this backend is allowed to time out, and thus should use a + * non-infinite sleep time. Only the highest-numbered worker is allowed to + * time out, and only if the pool is above io_min_workers. Serializing + * timeouts keeps IDs in a range 0..N without gaps, and avoids undershooting + * io_min_workers.But it's ok if a lower numbered worker errors out, right? There will be a
temporary gap, but we will start a new worker for it?
Yes it is OK for there to be gaps.
If any worker errors out, it will be replaced when reaped if we fell
below io_min_workers, and otherwise replaced via the usual means, ie
once the backlog detection and the launch delay allow it. I did have
a version that always replaced *every* worker with exit code 1
immediately, but I started wondering if we really want persistent
errors to turn into high speed fork() loops. I'm still not sure TBH.
We don't expect workers to error out, so it means something is already
pretty screwed up and you might appreciate the rate limiting?
I have an always-replace patch somewhere, as I've vacillated on that
point a couple of times. I will post a separate fixup for
consideration.
Does that happen even
if there's a shrink of the set of required workers at the same time as a lower
numbered worker errors out?
If a workers errors out (exit code 1) and an idle worker timed out
(exit code 0), then it's no different: if the new count dropped below
io_min_workers, we start a worker immediate after reaping the process.
Othewise we let the normal algorithm decide to start a new worker
if/when required.
@@ -439,10 +666,9 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len) while (!ShutdownRequestPending) { uint32 io_index; - Latch *latches[IO_WORKER_WAKEUP_FANOUT]; - int nlatches = 0; - int nwakeups = 0; - int worker; + int worker = -1; + int queue_depth = 0; + bool grow = false;/* * Try to get a job to do. @@ -453,38 +679,64 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len) LWLockAcquire(AioWorkerSubmissionQueueLock, LW_EXCLUSIVE); if ((io_index = pgaio_worker_submission_queue_consume()) == -1) { - /* - * Nothing to do. Mark self idle. - * - * XXX: Invent some kind of back pressure to reduce useless - * wakeups? - */ - io_worker_control->idle_worker_mask |= (UINT64_C(1) << MyIoWorkerId); + /* Nothing to do. Mark self idle. */ + pgaio_worker_set_insert(&io_worker_control->idle_worker_set, + MyIoWorkerId); } else { /* Got one. Clear idle flag. */ - io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId); + pgaio_worker_set_remove(&io_worker_control->idle_worker_set, + MyIoWorkerId);Wonder if we should keep track of whether we marked ourselves idle to avoid
needing to do that. But that would be a separate optimization really.
Fair point. OK.
+ /* + * See if we should wake up a higher numbered peer. Only do that + * if this worker is not receiving spurious wakeups itself.The "not receiving spurious wakeups" condition is wakeups < ios?
Yes, see new comment near PGAIO_WORKER_WAKEUP_RATIO_SATURATE.
I think both 'wakeups" and "ios" are a bit too generically named. Based on the
names I have no idea what this heuristic might be.
I have struggled to name them. Does wakeup_count and io_count help?
+ * This heuristic tries to discover the useful wakeup propagation + * chain length when IOs are very fast and workers wake up to find + * that all IOs have already been taken. + * + * If we chose not to wake a worker when we ideally should have, + * the ratio will soon be corrected. + */ + if (wakeups <= ios) { + queue_depth = pgaio_worker_submission_queue_depth(); + if (queue_depth > 0) + { + worker = pgaio_worker_choose_idle(MyIoWorkerId + 1);Is it a problem that we are passing an ID that's potentially bigger than the
biggest legal worker ID? It's probably fine as long as MAX_WORKERS is 32 and
the bitmap is a 64bit integer, but ...
Oof. Fixed.
+ /* + * If there were no idle higher numbered peers and there + * are more than enough IOs queued for me and all lower + * numbered peers, then try to start a new worker. + */ + if (worker == -1 && queue_depth > MyIoWorkerId) + grow = true; + }We probably shouldn't request growth when already at the cap? That could
generate a *lot* of pmsignal traffic, I think?
No, we only set it if it isn't already set (like a latch), and only
send a pmsignal when we set it (like a latch), and the postmaster only
clears it if it can start a worker (unlike a latch). That applies in
general, not just when we hit the cap of io_max_workers: while the
postmaster is waiting for launch interval to expire, it will leave the
flag set, suppressed for 100ms or whatever, and the in the special
case of io_max_workers, for as long as the count remains that high.
I don't have an immediate intuitive understanding of why the submission queue
depth is a good measure here.If there are 10 workers that are busy 100% of the time, and the submission
queue is usually 6 deep with not-being-worked-on IOs, why do we not want to
start more workers?It actually seems to work - but I don't actually understand why.
I should have made it clearer that that's a secondary condition. The
primary condition is: a worker wanted to wake another worker, but
found that none were idle. Unfortunately the whole system is a bit
too asynchronous for that to be a reliable cue on its own. So, I also
check if the queue appears to be (1) obviously growing: that's clearly
too long and must be introducing latency, or (2) varying "too much".
Which I detect in exactly the same way.
Imagine a histogram that look like this:
LOG: depth 00: 7898
LOG: depth 01: 1630
LOG: depth 02: 308
LOG: depth 03: 93
LOG: depth 04: 40
LOG: depth 05: 19
LOG: depth 06: 6
LOG: depth 07: 4
LOG: depth 08: 0
LOG: depth 09: 1
LOG: depth 10: 1
LOG: depth 11: 0
LOG: depth 12: 0
LOG: depth 13: 0
If you're failing to find idle workers to wake up AND our managic
threshold is hit by something in that long tail, then it'll call for
backup. Of course I'm totally sidestepping a lot of queueing theory
maths and just saying "I'd better be able to find an idle worker when
I want to" and if not, "there had better not be any outliers that
reach this far".
I've written a longer explanation in a long comment. Including a
little challenge for someone to do better with real science and maths.
I hope it's a bit clearer at least.
ninja install-test-files
io_max_workers=32
debug_io_direct=data
effective_io_concurrency=16
shared_buffers=5GBpgbench -i -q -s 100 --fillfactor=30
CREATE EXTENSION IF NOT EXISTS test_aio;
CREATE EXTENSION IF NOT EXISTS pg_buffercache;
DROP TABLE IF EXISTS pattern_random_pgbench;
CREATE TABLE pattern_random_pgbench AS SELECT ARRAY(SELECT random(0, pg_relation_size('pgbench_accounts')/8192 - 1)::int4 FROM generate_series(1, pg_relation_size('pgbench_accounts')/8192)) AS pattern;My test is:
SET effective_io_concurrency = 20;
SELECT pg_buffercache_evict_relation('pgbench_accounts');
SELECT read_stream_for_blocks('pgbench_accounts', pattern) FROM pattern_random_pgbench LIMIT 1;We end up with ~24-28 workers, even though we never have more than 20 IOs in
flight. Not entirely sure why. I guess it's just that after doing an IO the
worker needs to mark itself idle etc?
Yep. It would be nice to make it a bit more accurate in later cycles.
It tends to overprovision rather than under, since it thinks all other
workers are busy. That information is a bit racy. In this version
I've made a small improvement: it uses nworkers directly, under the
big new comment, instead of an unnecessarily complicated
approximation.
if (io_index != -1)
{
PgAioHandle *ioh = NULL;+ /* Cancel timeout and update wakeup:work ratio. */ + idle_timeout_abs = 0; + if (++ios == PGAIO_WORKER_STATS_MAX) + { + wakeups /= 2; + ios /= 2; + }/* Saturation for counters used to estimate wakeup:work ratio. */
#define PGAIO_WORKER_STATS_MAX 4STATS_MAX sounds like it's just about some reporting or such.
I have renamed it to PGAIO_WORKER_RATIO_MAX and written a big comment
at the top to explain what it's for.io
ioh = &pgaio_ctl->io_handles[io_index];
error_ioh = ioh;
errcallback.arg = ioh;
@@ -537,6 +789,14 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len)
}
#endif+#ifdef PGAIO_WORKER_SHOW_PS_INFO + sprintf(cmd, "%d: [%s] %s", + MyIoWorkerId, + pgaio_io_get_op_name(ioh), + pgaio_io_get_target_description(ioh)); + set_ps_display(cmd); +#endifNote that this leaks memory. See the target_description comment:
/*
* Return a stringified description of the IO's target.
*
* The string is localized and allocated in the current memory context.
*/
Fixed.
/* * We don't expect this to ever fail with ERROR or FATAL, no need * to keep error_ioh set to the IO. @@ -550,8 +810,75 @@ IoWorkerMain(const void *startup_data, size_t startup_data_len) } else { - WaitLatch(MyLatch, WL_LATCH_SET | WL_EXIT_ON_PM_DEATH, -1, - WAIT_EVENT_IO_WORKER_MAIN); + int timeout_ms; + + /* Cancel new worker request if pending. */ + pgaio_worker_grow(false);That seems to happen very frequently.
Yeah, but it doesn't write to memory after someone else does it. This
again is part of the strategy for preventing excess workers from being
created. If I've found the queue to be empty.
+ /* + * All workers maintain the absolute timeout value, but only + * the highest worker can actually time out and only if + * io_min_workers is satisfied. All others wait only for + * explicit wakeups caused by queue insertion, wakeup + * propagation, change of pool size (possibly promoting one to + * new highest) or GUC reload. + */ + if (pgaio_worker_can_timeout()) + timeout_ms = + TimestampDifferenceMilliseconds(now, + idle_timeout_abs); + else + timeout_ms = -1;Hm. This way you get very rapid worker pool reductions. Configured
io_worker_idle_timeout=1s, started a bunch of work of and observed the worker
count after the work finishes:Mon 06 Apr 2026 02:08:28 PM EDT (every 1s)
count
32
(1 row)
Mon 06 Apr 2026 02:08:29 PM EDT (every 1s)count
32
(1 row)
Mon 06 Apr 2026 02:08:30 PM EDT (every 1s)count
1
(1 row)
Mon 06 Apr 2026 02:08:31 PM EDT (every 1s)count
1
(1 row)Of course this is a ridiculuously low setting, but it does seems like starting
the timeout even when not the highest numbered worker will lead to a lot of
quick yoyoing.
I have changed it so that after one worker times out, the next one
begins its timeout count from 0. (This is one of the reasons for that
"notify the whole pool when I exit" thing.)
Attachments:
v5-0001-aio-Adjust-I-O-worker-pool-size-automatically.patchtext/x-patch; charset=US-ASCII; name=v5-0001-aio-Adjust-I-O-worker-pool-size-automatically.patchDownload+751-146
Hi,
On 2026-04-07 22:39:37 +1200, Thomas Munro wrote:
@@ -1582,14 +1584,16 @@ DetermineSleepTime(void)
return seconds * 1000;
}
- else
- return 60 * 1000;
}- if (StartWorkerNeeded) + /* Time of next maybe_start_io_workers() call, or 0 for none. */ + next_wakeup = maybe_start_io_workers_scheduled_at(); + + /* Ignore bgworkers during shutdown. */ + if (StartWorkerNeeded && Shutdown == NoShutdown) return 0;Why is the maybe_start_io_workers_scheduled_at() thing before the return 0
here?Seems OK? I mean sure I would to make this whole function more
uniform in structure, see my second patch, but...
It's ok, there just doesn't seem to be a point in doing it before that if,
rather than just after...
+static int +pgaio_worker_set_get_highest(PgAioWorkerSet *set) +{ + Assert(!pgaio_worker_set_is_empty(set)); + return pg_leftmost_one_pos64(*set); +}"worker_set_get*" reads quite awkwardly. Maybe just going for
pgaio_workerset_* would help?Or maybe just name it PgAioWset/pgaio_wset_ or such?
OK let's try "workerset".
Looks better.
Maybe just name it pgaio_worker_set_grow()?
OK how about:
pgaio_worker_request_grow()
pgaio_worker_cancel_grow()
WFM.
@@ -252,19 +438,15 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios)
break;
}- if (wakeup == NULL) - { - /* Choose an idle worker to wake up if we haven't already. */ - worker = pgaio_worker_choose_idle(); - if (worker >= 0) - wakeup = io_worker_control->workers[worker].latch; - - pgaio_debug_io(DEBUG4, staged_ios[i], - "choosing worker %d", - worker); - } + /* Choose one worker to wake for this batch. */ + if (worker == -1) + worker = pgaio_worker_choose_idle(0); }If we only want to do this once per "batch", why not just do it outside the
num_staged_ios loop?Two steps: pgaio_worker_choose_idle() must be done while holding the
queue lock (will probably finish up revising this in future work on
removing locks...). pgaio_worker_wake() is called outside the loop,
after releasing the lock.
I just meant doing it outside the for loop.
for (int i = 0; i < num_staged_ios; ++i)
{
Assert(!pgaio_worker_needs_synchronous_execution(staged_ios[i]));
if (!pgaio_worker_submission_queue_insert(staged_ios[i]))
{
/*
* Do the rest synchronously. If the queue is full, give up
* and do the rest synchronously. We're holding an exclusive
* lock on the queue so nothing can consume entries.
*/
synchronous_ios = &staged_ios[i];
nsync = (num_staged_ios - i);
break;
}
/* Choose one worker to wake for this batch. */
if (worker == -1)
worker = pgaio_worker_choose_idle(-1);
}
The if (worker == -1) is done for every to be submitted IO. If there are no
idle workers, we'd redo the pgaio_worker_choose_idle() every time. ISTM it
should just be:
for (int i = 0; i < num_staged_ios; ++i)
{
Assert(!pgaio_worker_needs_synchronous_execution(staged_ios[i]));
if (!pgaio_worker_submission_queue_insert(staged_ios[i]))
{
/*
* Do the rest synchronously. If the queue is full, give up
* and do the rest synchronously. We're holding an exclusive
* lock on the queue so nothing can consume entries.
*/
synchronous_ios = &staged_ios[i];
nsync = (num_staged_ios - i);
break;
}
}
/* Choose one worker to wake for this batch. */
if (worker == -1)
worker = pgaio_worker_choose_idle(-1);
@@ -295,14 +474,27 @@ pgaio_worker_submit(uint16 num_staged_ios, PgAioHandle **staged_ios) static void pgaio_worker_die(int code, Datum arg) { [...] + /* Notify other workers on pool change. */Why are we notifying them on pool changes?
Comments added to explain. It closes a wakeup-loss race (imagine if
you consumed a wakeup while you were exiting due to timeout; noone
else would wake up, which I fixed with this big hammer).
Thanks, looks a lot clearer now.
+/* + * Check if this backend is allowed to time out, and thus should use a + * non-infinite sleep time. Only the highest-numbered worker is allowed to + * time out, and only if the pool is above io_min_workers. Serializing + * timeouts keeps IDs in a range 0..N without gaps, and avoids undershooting + * io_min_workers.But it's ok if a lower numbered worker errors out, right? There will be a
temporary gap, but we will start a new worker for it?Yes it is OK for there to be gaps.
If any worker errors out, it will be replaced when reaped if we fell
below io_min_workers, and otherwise replaced via the usual means, ie
once the backlog detection and the launch delay allow it. I did have
a version that always replaced *every* worker with exit code 1
immediately, but I started wondering if we really want persistent
errors to turn into high speed fork() loops. I'm still not sure TBH.
We don't expect workers to error out, so it means something is already
pretty screwed up and you might appreciate the rate limiting?
Yea, I think it's saner not to do that.
I think both 'wakeups" and "ios" are a bit too generically named. Based on the
names I have no idea what this heuristic might be.I have struggled to name them. Does wakeup_count and io_count help?
hist_wakeups, hist_ios?
+ /* + * If there were no idle higher numbered peers and there + * are more than enough IOs queued for me and all lower + * numbered peers, then try to start a new worker. + */ + if (worker == -1 && queue_depth > MyIoWorkerId) + grow = true; + }We probably shouldn't request growth when already at the cap? That could
generate a *lot* of pmsignal traffic, I think?No, we only set it if it isn't already set (like a latch), and only
send a pmsignal when we set it (like a latch), and the postmaster only
clears it if it can start a worker (unlike a latch). That applies in
general, not just when we hit the cap of io_max_workers: while the
postmaster is waiting for launch interval to expire, it will leave the
flag set, suppressed for 100ms or whatever, and the in the special
case of io_max_workers, for as long as the count remains that high.
I'm quite certain that's not how it actually ended up working with the prior
version and the benchmark I showed, there indeed were a lot of requests to
postmaster. I think it's because pgaio_worker_cancel_grow() (forgot the old
name already) very frequently clears the flag, just for it to be immediately
set again.
Yep, still happens, does require the max to be smaller than 32 though.
While a lot of IO is happening, no new connections being started, and with
1781562 being postmaster's pid:
perf stat --no-inherit -p 1781562 -e raw_syscalls:sys_enter -r 0 sleep 1
Performance counter stats for process id '1781562':
2,790 raw_syscalls:sys_enter
1.001872667 seconds time elapsed
2,814 raw_syscalls:sys_enter
1.001983049 seconds time elapsed
3,036 raw_syscalls:sys_enter
1.001705850 seconds time elapsed
2,982 raw_syscalls:sys_enter
1.001881364 seconds time elapsed
I think it may need a timestamp in the shared state to not allow another
postmaster wake until some time has elapsed, or something.
I should have made it clearer that that's a secondary condition. The
primary condition is: a worker wanted to wake another worker, but
found that none were idle. Unfortunately the whole system is a bit
too asynchronous for that to be a reliable cue on its own. So, I also
check if the queue appears to be (1) obviously growing: that's clearly
too long and must be introducing latency, or (2) varying "too much".
Which I detect in exactly the same way.Imagine a histogram that look like this:
LOG: depth 00: 7898
LOG: depth 01: 1630
LOG: depth 02: 308
LOG: depth 03: 93
LOG: depth 04: 40
LOG: depth 05: 19
LOG: depth 06: 6
LOG: depth 07: 4
LOG: depth 08: 0
LOG: depth 09: 1
LOG: depth 10: 1
LOG: depth 11: 0
LOG: depth 12: 0
LOG: depth 13: 0If you're failing to find idle workers to wake up AND our managic
threshold is hit by something in that long tail, then it'll call for
backup. Of course I'm totally sidestepping a lot of queueing theory
maths and just saying "I'd better be able to find an idle worker when
I want to" and if not, "there had better not be any outliers that
reach this far".I've written a longer explanation in a long comment. Including a
little challenge for someone to do better with real science and maths.
I hope it's a bit clearer at least.
Definitely good to have that comment. Have to ponder it for a bit.
ninja install-test-files
io_max_workers=32
debug_io_direct=data
effective_io_concurrency=16
shared_buffers=5GBpgbench -i -q -s 100 --fillfactor=30
CREATE EXTENSION IF NOT EXISTS test_aio;
CREATE EXTENSION IF NOT EXISTS pg_buffercache;
DROP TABLE IF EXISTS pattern_random_pgbench;
CREATE TABLE pattern_random_pgbench AS SELECT ARRAY(SELECT random(0, pg_relation_size('pgbench_accounts')/8192 - 1)::int4 FROM generate_series(1, pg_relation_size('pgbench_accounts')/8192)) AS pattern;My test is:
SET effective_io_concurrency = 20;
SELECT pg_buffercache_evict_relation('pgbench_accounts');
SELECT read_stream_for_blocks('pgbench_accounts', pattern) FROM pattern_random_pgbench LIMIT 1;We end up with ~24-28 workers, even though we never have more than 20 IOs in
flight. Not entirely sure why. I guess it's just that after doing an IO the
worker needs to mark itself idle etc?Yep. It would be nice to make it a bit more accurate in later cycles.
It tends to overprovision rather than under, since it thinks all other
workers are busy.
I think that's the right direction to err into.
That information is a bit racy.
Yea, I think that's fine.
Hm. This way you get very rapid worker pool reductions. Configured
io_worker_idle_timeout=1s, started a bunch of work of and observed the worker
count after the work finishes:
...
Of course this is a ridiculuously low setting, but it does seems like starting
the timeout even when not the highest numbered worker will lead to a lot of
quick yoyoing.I have changed it so that after one worker times out, the next one
begins its timeout count from 0. (This is one of the reasons for that
"notify the whole pool when I exit" thing.)
That looks much better in a quick test.
I've not again looked through the details, but based on a relatively short
experiment, the one problematic thing I see is the frequent postmaster
requests.
Greetings,
Andres Freund
On Wed, Apr 8, 2026 at 7:01 AM Andres Freund <andres@anarazel.de> wrote:
The if (worker == -1) is done for every to be submitted IO. If there are no
idle workers, we'd redo the pgaio_worker_choose_idle() every time. ISTM it
should just be:for (int i = 0; i < num_staged_ios; ++i)
{
Assert(!pgaio_worker_needs_synchronous_execution(staged_ios[i]));
if (!pgaio_worker_submission_queue_insert(staged_ios[i]))
{
/*
* Do the rest synchronously. If the queue is full, give up
* and do the rest synchronously. We're holding an exclusive
* lock on the queue so nothing can consume entries.
*/
synchronous_ios = &staged_ios[i];
nsync = (num_staged_ios - i);break;
}
}/* Choose one worker to wake for this batch. */
if (worker == -1)
worker = pgaio_worker_choose_idle(-1);
Well I didn't want to wake a worker if we'd failed to enqueue
anything. Ahh, I could put it there and test nsync. Or I guess I
could just do it anyway. Considering that.
I think both 'wakeups" and "ios" are a bit too generically named. Based on the
names I have no idea what this heuristic might be.I have struggled to name them. Does wakeup_count and io_count help?
hist_wakeups, hist_ios?
Thanks, that's a good name.
No, we only set it if it isn't already set (like a latch), and only
send a pmsignal when we set it (like a latch), and the postmaster only
clears it if it can start a worker (unlike a latch). That applies in
general, not just when we hit the cap of io_max_workers: while the
postmaster is waiting for launch interval to expire, it will leave the
flag set, suppressed for 100ms or whatever, and the in the special
case of io_max_workers, for as long as the count remains that high.I'm quite certain that's not how it actually ended up working with the prior
version and the benchmark I showed, there indeed were a lot of requests to
postmaster. I think it's because pgaio_worker_cancel_grow() (forgot the old
name already) very frequently clears the flag, just for it to be immediately
set again.Yep, still happens, does require the max to be smaller than 32 though.
While a lot of IO is happening, no new connections being started, and with
1781562 being postmaster's pid:perf stat --no-inherit -p 1781562 -e raw_syscalls:sys_enter -r 0 sleep 1
Performance counter stats for process id '1781562':
2,790 raw_syscalls:sys_enter
1.001872667 seconds time elapsed
2,814 raw_syscalls:sys_enter
1.001983049 seconds time elapsed
3,036 raw_syscalls:sys_enter
1.001705850 seconds time elapsed
2,982 raw_syscalls:sys_enter
1.001881364 seconds time elapsed
I think it may need a timestamp in the shared state to not allow another
postmaster wake until some time has elapsed, or something.
Hnng. Studying...
I should have made it clearer that that's a secondary condition. The
primary condition is: a worker wanted to wake another worker, but
found that none were idle. Unfortunately the whole system is a bit
too asynchronous for that to be a reliable cue on its own. So, I also
check if the queue appears to be (1) obviously growing: that's clearly
too long and must be introducing latency, or (2) varying "too much".
Which I detect in exactly the same way.Imagine a histogram that look like this:
LOG: depth 00: 7898
LOG: depth 01: 1630
LOG: depth 02: 308
LOG: depth 03: 93
LOG: depth 04: 40
LOG: depth 05: 19
LOG: depth 06: 6
LOG: depth 07: 4
LOG: depth 08: 0
LOG: depth 09: 1
LOG: depth 10: 1
LOG: depth 11: 0
LOG: depth 12: 0
LOG: depth 13: 0If you're failing to find idle workers to wake up AND our managic
threshold is hit by something in that long tail, then it'll call for
backup. Of course I'm totally sidestepping a lot of queueing theory
maths and just saying "I'd better be able to find an idle worker when
I want to" and if not, "there had better not be any outliers that
reach this far".I've written a longer explanation in a long comment. Including a
little challenge for someone to do better with real science and maths.
I hope it's a bit clearer at least.Definitely good to have that comment. Have to ponder it for a bit.
Let me try again.
Our goal is simple: process every IO immediately. We have immediate
feedback that is simple: there's an IO in the queue and there is no
idle worker. The only action we can take is simple: add one more
worker. So we don't need to suffer through the maths required to
figure out the ideal k for our M/G/k queue system (I think that's what
we have?) or any of the inputs that would require*. The problem is
that on its own, the test triggered far too easily because a worker
that is not marked idle might in fact be just about to pick up that IO
on the one the one hand, and because there might be rare
spikes/clustering on the other, so I cooled it off a bit by
additionally testing if the queue appears to be growing or spiking
beyond some threshold. I think it's OK to let the queue grow a bit
before we are triggered anyway, so the precise value used doesn't seem
too critical. Someone might be able to come up with a more defensible
value, but in the end I just wanted a value that isn't triggered by
the outliers I see in real systems that are keeping up. We could tune
it lower and overshoot more, but this setting seems to work pretty
well. It doesn't seem likely that a real system could achieve a
steady state that is introducing latency but isn't increasing over
time, and pool size adjustments are bound to lag anyway.
* It's probably quite hard for call centres to figure out the number
of agents required to make you wait for a certain length of time, but
it's easy to know if you had to wait and you wish they had more!
I've not again looked through the details, but based on a relatively short
experiment, the one problematic thing I see is the frequent postmaster
requests.
Looking into that...
Hi,
On 2026-04-08 11:18:51 +1200, Thomas Munro wrote:
On Wed, Apr 8, 2026 at 7:01 AM Andres Freund <andres@anarazel.de> wrote:
The if (worker == -1) is done for every to be submitted IO. If there are no
idle workers, we'd redo the pgaio_worker_choose_idle() every time. ISTM it
should just be:for (int i = 0; i < num_staged_ios; ++i)
{
Assert(!pgaio_worker_needs_synchronous_execution(staged_ios[i]));
if (!pgaio_worker_submission_queue_insert(staged_ios[i]))
{
/*
* Do the rest synchronously. If the queue is full, give up
* and do the rest synchronously. We're holding an exclusive
* lock on the queue so nothing can consume entries.
*/
synchronous_ios = &staged_ios[i];
nsync = (num_staged_ios - i);break;
}
}/* Choose one worker to wake for this batch. */
if (worker == -1)
worker = pgaio_worker_choose_idle(-1);Well I didn't want to wake a worker if we'd failed to enqueue
anything.
I think it's worth waking up workers if there are idle ones and the queue is
full?
No, we only set it if it isn't already set (like a latch), and only
send a pmsignal when we set it (like a latch), and the postmaster only
clears it if it can start a worker (unlike a latch). That applies in
general, not just when we hit the cap of io_max_workers: while the
postmaster is waiting for launch interval to expire, it will leave the
flag set, suppressed for 100ms or whatever, and the in the special
case of io_max_workers, for as long as the count remains that high.I'm quite certain that's not how it actually ended up working with the prior
version and the benchmark I showed, there indeed were a lot of requests to
postmaster. I think it's because pgaio_worker_cancel_grow() (forgot the old
name already) very frequently clears the flag, just for it to be immediately
set again.Yep, still happens, does require the max to be smaller than 32 though.
While a lot of IO is happening, no new connections being started, and with
1781562 being postmaster's pid:perf stat --no-inherit -p 1781562 -e raw_syscalls:sys_enter -r 0 sleep 1
2,982 raw_syscalls:sys_enter
1.001881364 seconds time elapsed
I think it may need a timestamp in the shared state to not allow another
postmaster wake until some time has elapsed, or something.Hnng. Studying...
I suspect the primary reasonis that pgaio_worker_request_grow() is triggered
even when io_worker_control->nworkers is >= io_max_workers.
I suspect there's also pingpong between submission not finding any workers
idle, requesting growth, and workers being idle for a short period, then the
same thing starting again.
Seems like there should be two fields. One saying "notify postmaster again"
and one "postmaster start a worker". The former would only be cleared by
postmaster after the timeout.
Our goal is simple: process every IO immediately. We have immediate
feedback that is simple: there's an IO in the queue and there is no
idle worker. The only action we can take is simple: add one more
worker. So we don't need to suffer through the maths required to
figure out the ideal k for our M/G/k queue system (I think that's what
we have?) or any of the inputs that would require*. The problem is
that on its own, the test triggered far too easily because a worker
that is not marked idle might in fact be just about to pick up that IO
Is that case really concerning? As long as you have some rate limiting about
the start rate, starting another worker when there are no idle workers seems
harmless? Afaict it's fairly self limiting.
on the one the one hand, and because there might be rare
spikes/clustering on the other, so I cooled it off a bit by
additionally testing if the queue appears to be growing or spiking
beyond some threshold. I think it's OK to let the queue grow a bit
before we are triggered anyway, so the precise value used doesn't seem
too critical. Someone might be able to come up with a more defensible
value, but in the end I just wanted a value that isn't triggered by
the outliers I see in real systems that are keeping up. We could tune
it lower and overshoot more, but this setting seems to work pretty
well. It doesn't seem likely that a real system could achieve a
steady state that is introducing latency but isn't increasing over
time, and pool size adjustments are bound to lag anyway.
Yea, I don't think the precise logic matters that much as long as we ramp up
reasonably fast without being crazy and ramp up a bit faster.
Greetings,
Andres Freund
I changed pgaio_worker_request_grow() not to bother the postmaster
unless nworkers < io_max_workers.
I move that code you wanted outside the loop and did:
/* Choose one worker to wake for this batch. */
if (nsync < num_staged_ios)
worker = pgaio_worker_choose_idle(-1);
I took your suggestion for the names hist_wakeups and hist_ios.
For the location of the following line, I preferred not to separate
the pre-existing tests of StartWorkerNeeded and HaveCrashedWorker,
since they belong together as bgworker concerns.
next_wakeup = maybe_start_io_workers_scheduled_at();
I think I've run out of reasons not to commit this, unless your
pondering of the grow-trigger heuristics revealed a problem?