pgsql: Increase default maintenance_io_concurrency to 16
Increase default maintenance_io_concurrency to 16
Since its introduction in fc34b0d9de27a, the default
maintenance_io_concurrency has been larger than the default
effective_io_concurrency. maintenance_io_concurrency primarily
controlled prefetching done on behalf of the whole system, for
operations like recovery. Therefore it makes sense for it to have a
value equal to or greater than effective_io_concurrency, which controls
I/O concurrency for reading a relation in a bitmap heap scan.
ff79b5b2ab increased effective_io_concurrency to 16, so we'll increase
maintenance_io_concurrency as well. For now, though, we'll keep the
defaults of effective_io_concurrency and maintenance_io_concurrency
equal to one another (16).
On fast, high IOPs systems, significantly higher values of
maintenance_io_concurrency are observably beneficial [1]/messages/by-id/c5d52837-6256-0556-ac8c-d6d3d558820a@enterprisedb.com. However, such
values would flood low IOPs systems and increase overall system I/O
latency.
It is worth mentioning that since 9256822608f and c3e775e608f,
maintenance_io_concurrency also controls the I/O concurrency of each
vacuum worker. Since many autovacuum workers may be simultaneously
issuing I/Os, we want to keep maintenance_io_concurrency appropriately
conservative.
[1]: /messages/by-id/c5d52837-6256-0556-ac8c-d6d3d558820a@enterprisedb.com
Suggested-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Discussion: /messages/by-id/CAKZiRmxdHQaU+2Zpe6d=x=0vigJ1sfWwwVYLJAf=ud_wQ_VcUw@mail.gmail.com
Branch
------
master
Details
-------
https://git.postgresql.org/pg/commitdiff/cc6be07ebde2aef7cc0507f997f563ce77c00a00
Modified Files
--------------
doc/src/sgml/config.sgml | 8 ++++----
src/backend/utils/misc/postgresql.conf.sample | 2 +-
src/include/storage/bufmgr.h | 2 +-
3 files changed, 6 insertions(+), 6 deletions(-)
This commit makes our default random_page_cost = 4 out of line with
these new settings (assumes modern SSD/NAS/SAN hardware) and more out of
line with reality.
---------------------------------------------------------------------------
On Tue, Mar 18, 2025 at 01:08:47PM +0000, Melanie Plageman wrote:
Increase default maintenance_io_concurrency to 16
Since its introduction in fc34b0d9de27a, the default
maintenance_io_concurrency has been larger than the default
effective_io_concurrency. maintenance_io_concurrency primarily
controlled prefetching done on behalf of the whole system, for
operations like recovery. Therefore it makes sense for it to have a
value equal to or greater than effective_io_concurrency, which controls
I/O concurrency for reading a relation in a bitmap heap scan.ff79b5b2ab increased effective_io_concurrency to 16, so we'll increase
maintenance_io_concurrency as well. For now, though, we'll keep the
defaults of effective_io_concurrency and maintenance_io_concurrency
equal to one another (16).On fast, high IOPs systems, significantly higher values of
maintenance_io_concurrency are observably beneficial [1]. However, such
values would flood low IOPs systems and increase overall system I/O
latency.It is worth mentioning that since 9256822608f and c3e775e608f,
maintenance_io_concurrency also controls the I/O concurrency of each
vacuum worker. Since many autovacuum workers may be simultaneously
issuing I/Os, we want to keep maintenance_io_concurrency appropriately
conservative.[1] /messages/by-id/c5d52837-6256-0556-ac8c-d6d3d558820a@enterprisedb.com
Suggested-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Discussion: /messages/by-id/CAKZiRmxdHQaU+2Zpe6d=x=0vigJ1sfWwwVYLJAf=ud_wQ_VcUw@mail.gmail.comBranch
------
masterDetails
-------
https://git.postgresql.org/pg/commitdiff/cc6be07ebde2aef7cc0507f997f563ce77c00a00Modified Files
--------------
doc/src/sgml/config.sgml | 8 ++++----
src/backend/utils/misc/postgresql.conf.sample | 2 +-
src/include/storage/bufmgr.h | 2 +-
3 files changed, 6 insertions(+), 6 deletions(-)
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Do not let urgent matters crowd out time for investment in the future.
Hi,
On 2025-03-18 16:08:22 -0400, Bruce Momjian wrote:
This commit makes our default random_page_cost = 4 out of line with
these new settings (assumes modern SSD/NAS/SAN hardware) and more out of
line with reality.
How so? That seems like an independent consideration to me.
Greetings,
Andres Freund
On Tue, Mar 18, 2025 at 04:13:26PM -0400, Andres Freund wrote:
Hi,
On 2025-03-18 16:08:22 -0400, Bruce Momjian wrote:
This commit makes our default random_page_cost = 4 out of line with
these new settings (assumes modern SSD/NAS/SAN hardware) and more out of
line with reality.How so? That seems like an independent consideration to me.
[thread moved to hackers]
Uh, I think our old random_page_cost and *_io_concurrency assumed
magnetic disks --- now *_io_concurrency assumes more modern hardware and
random_page_cost assumes magnetic.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Do not let urgent matters crowd out time for investment in the future.
Hi,
On 2025-03-18 16:22:45 -0400, Bruce Momjian wrote:
On Tue, Mar 18, 2025 at 04:13:26PM -0400, Andres Freund wrote:
Hi,
On 2025-03-18 16:08:22 -0400, Bruce Momjian wrote:
This commit makes our default random_page_cost = 4 out of line with
these new settings (assumes modern SSD/NAS/SAN hardware) and more out of
line with reality.How so? That seems like an independent consideration to me.
[thread moved to hackers]
Uh, I think our old random_page_cost and *_io_concurrency assumed
magnetic disks --- now *_io_concurrency assumes more modern hardware and
random_page_cost assumes magnetic.
The cost difference between random and non-random IO is actually still
reasonably accurate with NVMEs. You can argue that random_page_cost should be
2.5, but that really depends on the specific hardware.
Particularly for cloud style networked storage, you could even argue that the
difference between sequential and random IO has *grow* given recent changes in
PG (io combining in PG 17), as random IOs much more quickly lead to exhausting
IOPS quotas.
I still don't think adjusting random_page_cost has any meaningful relation to
the change at hand.
Greetings,
Andres Freund
On Tue, Mar 18, 2025 at 04:27:18PM -0400, Andres Freund wrote:
Hi,
On 2025-03-18 16:22:45 -0400, Bruce Momjian wrote:
On Tue, Mar 18, 2025 at 04:13:26PM -0400, Andres Freund wrote:
Hi,
On 2025-03-18 16:08:22 -0400, Bruce Momjian wrote:
This commit makes our default random_page_cost = 4 out of line with
these new settings (assumes modern SSD/NAS/SAN hardware) and more out of
line with reality.How so? That seems like an independent consideration to me.
[thread moved to hackers]
Uh, I think our old random_page_cost and *_io_concurrency assumed
magnetic disks --- now *_io_concurrency assumes more modern hardware and
random_page_cost assumes magnetic.The cost difference between random and non-random IO is actually still
reasonably accurate with NVMEs. You can argue that random_page_cost should be
2.5, but that really depends on the specific hardware.
Uh, the random_page_cost = 4 assumes caching, so it is assuming actual
random I/O to be 40x slower, which I doubt is true for SSDs:
https://www.postgresql.org/docs/current/runtime-config-query.html#RUNTIME-CONFIG-QUERY-CONSTANTS
Random access to mechanical disk storage is normally much more expensive
than four times sequential access. However, a lower default is used
(4.0) because the majority of random accesses to disk, such as indexed
reads, are assumed to be in cache. The default value can be thought of
as modeling random access as 40 times slower than sequential, while
expecting 90% of random reads to be cached.
Particularly for cloud style networked storage, you could even argue that the
difference between sequential and random IO has *grow* given recent changes in
PG (io combining in PG 17), as random IOs much more quickly lead to exhausting
IOPS quotas.I still don't think adjusting random_page_cost has any meaningful relation to
the change at hand.
Okay, I was just asking.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Do not let urgent matters crowd out time for investment in the future.
Hi,
On 2025-03-18 16:35:29 -0400, Bruce Momjian wrote:
Uh, the random_page_cost = 4 assumes caching, so it is assuming actual
random I/O to be 40x slower, which I doubt is true for SSDs:
Uh, huh:
https://www.postgresql.org/docs/current/runtime-config-query.html#RUNTIME-CONFIG-QUERY-CONSTANTS
Random access to mechanical disk storage is normally much more expensive
than four times sequential access. However, a lower default is used
(4.0) because the majority of random accesses to disk, such as indexed
reads, are assumed to be in cache. The default value can be thought of
as modeling random access as 40 times slower than sequential, while
expecting 90% of random reads to be cached.
Is that actually a good description of what we assume? I don't know where that
90% is coming from? Briefly skimming through selfuncs.c and costsize.c I don't
see anything.
The relevant change:
commit c1d9df4fa227781b31be44a5a3024865a7f48049
Author: Bruce Momjian <bruce@momjian.us>
Date: 2012-02-14 16:54:54 -0500
Document random page cost is only 4x seqeuntial, and not 40x.
The relevant discussion seems to be:
/messages/by-id/4F31A05A.1060506@2ndQuadrant.com
But I don't see any origin of that number in that thread.
I am not sure if I found the correct email for Greg Smith?
Greetings,
Andres Freund
On Tue, Mar 18, 2025 at 05:04:46PM -0400, Andres Freund wrote:
Hi,
On 2025-03-18 16:35:29 -0400, Bruce Momjian wrote:
Uh, the random_page_cost = 4 assumes caching, so it is assuming actual
random I/O to be 40x slower, which I doubt is true for SSDs:Uh, huh:
https://www.postgresql.org/docs/current/runtime-config-query.html#RUNTIME-CONFIG-QUERY-CONSTANTS
Random access to mechanical disk storage is normally much more expensive
than four times sequential access. However, a lower default is used
(4.0) because the majority of random accesses to disk, such as indexed
reads, are assumed to be in cache. The default value can be thought of
as modeling random access as 40 times slower than sequential, while
expecting 90% of random reads to be cached.Is that actually a good description of what we assume? I don't know where that
90% is coming from? Briefly skimming through selfuncs.c and costsize.c I don't
see anything.
The next paragraph is:
If you believe a 90% cache rate is an incorrect assumption
for your workload, you can increase random_page_cost to better
reflect the true cost of random storage reads. Correspondingly,
if your data is likely to be completely in cache, such as when
the database is smaller than the total server memory, decreasing
random_page_cost can be appropriate. Storage that has a low random
read cost relative to sequential, e.g., solid-state drives, might
also be better modeled with a lower value for random_page_cost,
e.g., 1.1.
The relevant change:
commit c1d9df4fa227781b31be44a5a3024865a7f48049
Author: Bruce Momjian <bruce@momjian.us>
Date: 2012-02-14 16:54:54 -0500Document random page cost is only 4x seqeuntial, and not 40x.
The relevant discussion seems to be:
/messages/by-id/4F31A05A.1060506@2ndQuadrant.comBut I don't see any origin of that number in that thread.
I am not sure if I found the correct email for Greg Smith?
Yes, I can't say there is much research behind the value, and even if
there was, the assumed hardware is unlikely to be relevant today.
8
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Do not let urgent matters crowd out time for investment in the future.
Bruce Momjian <bruce@momjian.us> writes:
https://www.postgresql.org/docs/current/runtime-config-query.html#RUNTIME-CONFIG-QUERY-CONSTANTS
Random access to mechanical disk storage is normally much more expensive
than four times sequential access. However, a lower default is used
(4.0) because the majority of random accesses to disk, such as indexed
reads, are assumed to be in cache. The default value can be thought of
as modeling random access as 40 times slower than sequential, while
expecting 90% of random reads to be cached.
This explanation is, in fact, made up out of whole cloth. It has
diddly-squat to do with where the number came from, although maybe
it's an okay rationalization for continuing to use 4.0 on modern
hardware.
Where the number came from is that I did a bunch of testing back in
the late 90s and random_page_cost = 4.0 made the planner's ratios of
seqscan vs indexscan costs match up with observed timings. Of course,
those observations were made with spinning-rust drives, so you'd not
get the same results on SSDs.
Yes, I can't say there is much research behind the value, and even if
there was, the assumed hardware is unlikely to be relevant today.
Yes as to the latter, no as to the former.
regards, tom lane
On Tue, Mar 18, 2025 at 5:04 PM Andres Freund <andres@anarazel.de> wrote:
Is that actually a good description of what we assume? I don't know where
that
90% is coming from?
That one's all my fault. It was an attempt to curve-fit backwards why the
4.0 number Tom set with his initial commit worked as well as it did given
that underlying storage was closer to 50X as slow, and I sold the idea well
enough for Bruce to follow the reasoning and commit it. Back then there
was a regular procession of people who measured the actual rate and
wondered why there was the order of magnitude difference between those
measurements and the parameter. Pointing them toward thinking in terms of
the cached read percentage too did a reasonable job of deflecting them onto
why the model was more complicated than it seems. I intended to follow
that up with more measurements, only to lose the whole project into a
non-disclosure void I have only recently escaped
I agree with your observation that the underlying cost of a non-sequential
read stall on cloud storage is not markedly better than the original
random: sequential ratio of mechanical drives. And the PG17 refactoring
to improve I/O chunking worked to magnify that further.
The end of this problem I'm working on again is assembling some useful mix
of workloads such that I can try changing one of these magic constants with
higher confidence. My main working set so far is write performance
regression test sets against the Open Street Map loading workload, that
I've been blogging about, plus the old read-only queries of the SELECT-only
spaced along a scale/client grid. My experiments so far have been around
another Tom special, the maximum buffer usage count limit, which turned
into another black hole full of work I have only recently escaped. I
haven't really thought much yet about a workload set that would allow
adjusting random_page_cost. On the query side we've been pretty heads down
on the TPC-H and Clickbench sets. I don't have buffer internals data from
those yet though, will have to add that to the work queue.
--
Greg Smith
Director of Open Source Strategy, Crunchy Data
greg.smith@crunchydata.com