effective_io_concurrency's steampunk spindle maths
Hello,
I was reading through some old threads[1]/messages/by-id/CAHyXU0yaUG9R_E5=1gdXhD-MpWR=Gr=4=EHFD_fRid2+SCQrqA@mail.gmail.com[2]/messages/by-id/Pine.GSO.4.64.0809220317320.20434@westnet.com[3]/messages/by-id/FDDBA24E-FF4D-4654-BA75-692B3BA71B97@enterprisedb.com while trying to figure
out how to add a new GUC to control I/O prefetching for new kinds of
things[4]/messages/by-id/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com[5]/messages/by-id/CA+TgmoZP-CTmEPZdmqEOb+6t_Tts2nuF7eoqxxvXEHaUoBDmsw@mail.gmail.com, and enjoyed Simon Riggs' reference to Jules Verne in the
context of RAID spindles.
On 2 Sep 2015 14:54, "Andres Freund" <andres(at)anarazel(dot)de> wrote:
On 2015-09-02 18:06:54 +0200, Tomas Vondra wrote:
Maybe the best thing we can do is just completely abandon the "number of
spindles" idea, and just say "number of I/O requests to prefetch". Possibly
with an explanation of how to estimate it (devices * queue length).I think that'd be a lot better.
+many, though I doubt I could describe how to estimate it myself,
considering cloud storage, SANs, multi-lane NVMe etc. You basically
have to experiment, and like most of our resource consumption limits,
it's a per-backend limit anyway, so it's pretty complicated, but I
don't see how the harmonic series helps anyone.
Should we rename it? Here are my first suggestions:
random_page_prefetch_degree
maintenance_random_page_prefetch_degree
Rationale for this naming pattern:
* "random_page" from "random_page_cost"
* leaves room for a different setting for sequential prefetching
* "degree" conveys the idea without using loaded words like "queue"
that might imply we know something about the I/O subsystem or that
it's system-wide like kernel and device queues
* "maintenance_" prefix is like other GUCs that establish (presumably
larger) limits for processes working on behalf of many user sessions
Whatever we call it, I don't think it makes sense to try to model the
details of any particular storage system. Let's use a simple counter
of I/Os initiated but not yet known to have completed (for now: it has
definitely completed when the associated pread() complete; perhaps
something involving real async I/O completion notification in later
releases).
[1]: /messages/by-id/CAHyXU0yaUG9R_E5=1gdXhD-MpWR=Gr=4=EHFD_fRid2+SCQrqA@mail.gmail.com
[2]: /messages/by-id/Pine.GSO.4.64.0809220317320.20434@westnet.com
[3]: /messages/by-id/FDDBA24E-FF4D-4654-BA75-692B3BA71B97@enterprisedb.com
[4]: /messages/by-id/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com
[5]: /messages/by-id/CA+TgmoZP-CTmEPZdmqEOb+6t_Tts2nuF7eoqxxvXEHaUoBDmsw@mail.gmail.com
Hi,
On 2020-03-02 18:28:41 +1300, Thomas Munro wrote:
I was reading through some old threads[1][2][3] while trying to figure
out how to add a new GUC to control I/O prefetching for new kinds of
things[4][5], and enjoyed Simon Riggs' reference to Jules Verne in the
context of RAID spindles.On 2 Sep 2015 14:54, "Andres Freund" <andres(at)anarazel(dot)de> wrote:
On 2015-09-02 18:06:54 +0200, Tomas Vondra wrote:
Maybe the best thing we can do is just completely abandon the "number of
spindles" idea, and just say "number of I/O requests to prefetch". Possibly
with an explanation of how to estimate it (devices * queue length).I think that'd be a lot better.
+many, though I doubt I could describe how to estimate it myself,
considering cloud storage, SANs, multi-lane NVMe etc. You basically
have to experiment, and like most of our resource consumption limits,
it's a per-backend limit anyway, so it's pretty complicated, but I
don't see how the harmonic series helps anyone.Should we rename it? Here are my first suggestions:
Why rename? It's not like anybody knew how to infer a useful value for
effective_io_concurrency, given the math computing the actually
effective prefetch distance... I feel like we'll just unnecessarily
cause people difficulty by doing so.
random_page_prefetch_degree
maintenance_random_page_prefetch_degree
I don't like these names.
Rationale for this naming pattern:
* "random_page" from "random_page_cost"
I don't think we want to corner us into only ever using these for random
io.
* leaves room for a different setting for sequential prefetching
I think if we want to split those at some point, we ought to split it if
we have a good reason, not before. It's not at all clear to me why you'd
want a substantially different queue depth for both.
* "degree" conveys the idea without using loaded words like "queue"
that might imply we know something about the I/O subsystem or that
it's system-wide like kernel and device queues
Why is that good? Queue depth is a pretty well established term. You can
search for benchmarks of devices with it, you can correlate with OS
config, etc.
* "maintenance_" prefix is like other GUCs that establish (presumably
larger) limits for processes working on behalf of many user sessions
That part makes sense to me.
Whatever we call it, I don't think it makes sense to try to model the
details of any particular storage system. Let's use a simple counter
of I/Os initiated but not yet known to have completed (for now: it has
definitely completed when the associated pread() complete; perhaps
something involving real async I/O completion notification in later
releases).
+1
Greetings,
Andres Freund
On Fri, Mar 06, 2020 at 10:05:13AM -0800, Andres Freund wrote:
Hi,
On 2020-03-02 18:28:41 +1300, Thomas Munro wrote:
I was reading through some old threads[1][2][3] while trying to figure
out how to add a new GUC to control I/O prefetching for new kinds of
things[4][5], and enjoyed Simon Riggs' reference to Jules Verne in the
context of RAID spindles.On 2 Sep 2015 14:54, "Andres Freund" <andres(at)anarazel(dot)de> wrote:
On 2015-09-02 18:06:54 +0200, Tomas Vondra wrote:
Maybe the best thing we can do is just completely abandon the "number of
spindles" idea, and just say "number of I/O requests to prefetch". Possibly
with an explanation of how to estimate it (devices * queue length).I think that'd be a lot better.
+many, though I doubt I could describe how to estimate it myself,
considering cloud storage, SANs, multi-lane NVMe etc. You basically
have to experiment, and like most of our resource consumption limits,
it's a per-backend limit anyway, so it's pretty complicated, but I
don't see how the harmonic series helps anyone.Should we rename it? Here are my first suggestions:
Why rename? It's not like anybody knew how to infer a useful value for
effective_io_concurrency, given the math computing the actually
effective prefetch distance... I feel like we'll just unnecessarily
cause people difficulty by doing so.
I think the main issue with keeping the current GUC name is that if you
had a value that worked, we'll silently interpret it differently. Which
is a bit annoying :-(
So I think we should either rename e_i_c or keep it as is, and then also
have a new GUC. And then translate the values between those (but that
might be overkill).
random_page_prefetch_degree
maintenance_random_page_prefetch_degreeI don't like these names.
What about these names?
* effective_io_prefetch_distance
* effective_io_prefetch_queue
* effective_io_queue_depth
Rationale for this naming pattern:
* "random_page" from "random_page_cost"I don't think we want to corner us into only ever using these for random
io.
+1
* leaves room for a different setting for sequential prefetching
I think if we want to split those at some point, we ought to split it if
we have a good reason, not before. It's not at all clear to me why you'd
want a substantially different queue depth for both.
+1
* "degree" conveys the idea without using loaded words like "queue"
that might imply we know something about the I/O subsystem or that
it's system-wide like kernel and device queuesWhy is that good? Queue depth is a pretty well established term. You can
search for benchmarks of devices with it, you can correlate with OS
config, etc.
I mostly agree. With "queue depth" people have a fairly good idea what
they're setting, while with "degree" that's pretty unlikely I think.
* "maintenance_" prefix is like other GUCs that establish (presumably
larger) limits for processes working on behalf of many user sessionsThat part makes sense to me.
Whatever we call it, I don't think it makes sense to try to model the
details of any particular storage system. Let's use a simple counter
of I/Os initiated but not yet known to have completed (for now: it has
definitely completed when the associated pread() complete; perhaps
something involving real async I/O completion notification in later
releases).+1
Agreed.
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
So I think we should either rename e_i_c or keep it as is, and then also
have a new GUC. And then translate the values between those (but that
might be overkill).
Please DON'T try to have two interrelated GUCs for this. We learned
our lesson about that years ago.
I think dropping the existing GUC is a perfectly sane thing to do,
if the new definition wouldn't be compatible. In practice few
people will notice, because few will have set it.
regards, tom lane
Hi,
On Mon, Mar 02, 2020 at 06:28:41PM +1300, Thomas Munro wrote:
Should we rename it? Here are my first suggestions:
maintenance_random_page_prefetch_degree
Rationale for this naming pattern:
[...]
* "maintenance_" prefix is like other GUCs that establish (presumably
larger) limits for processes working on behalf of many user sessions
I'm a bit skeptical about this - at least in V12 there's only two GUCs
with 'maintenance' in the name: maintenance_work_mem and
max_parallel_maintenance_workers. Both are used for utility commands and
do not apply to regular user queries while (AFAICT) your proposal is not
limited to utility commands. So I think if you name it
'maintenance'-something, people will assume it only involves VACUUM or
so.
Michael
--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael.banck@credativ.de
credativ GmbH, HRB M�nchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 M�nchengladbach
Gesch�ftsf�hrung: Dr. Michael Meskes, J�rg Folz, Sascha Heuer
Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz
On Sat, Mar 7, 2020 at 9:52 AM Michael Banck <michael.banck@credativ.de> wrote:
On Mon, Mar 02, 2020 at 06:28:41PM +1300, Thomas Munro wrote:
* "maintenance_" prefix is like other GUCs that establish (presumably
larger) limits for processes working on behalf of many user sessionsI'm a bit skeptical about this - at least in V12 there's only two GUCs
with 'maintenance' in the name: maintenance_work_mem and
max_parallel_maintenance_workers. Both are used for utility commands and
do not apply to regular user queries while (AFAICT) your proposal is not
limited to utility commands. So I think if you name it
'maintenance'-something, people will assume it only involves VACUUM or
so.
No, the proposal is not for the "maintenance" GUC to affect user
queries. The idea is that the "maintenance" GUC would be used for WAL
prefetching during recovery[1]/messages/by-id/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com, index prefetch during VACUUM[2]/messages/by-id/CA+TgmoZP-CTmEPZdmqEOb+6t_Tts2nuF7eoqxxvXEHaUoBDmsw@mail.gmail.com and
probably some other proposed things that are in development relating
to background "undo" processing. What these things have in common, as
Andres first articulated on thread [2]/messages/by-id/CA+TgmoZP-CTmEPZdmqEOb+6t_Tts2nuF7eoqxxvXEHaUoBDmsw@mail.gmail.com is that they all deal with a
workload that is correlated with the activities of multiple user
backends running concurrently. That's the basic idea of the WAL
prefetching patch: even though all backends suffer from I/O stalls due
to cache misses, usually that's happening concurrently in many
backends. A streaming replica that is trying to follow along
replaying the write-workload of the primary has to suffer all those
stalls sequentially, so I'm trying to recreate some I/O parallelism by
looking ahead in the WAL. The theory with the two GUCs is that a user
backend should be able to use some I/O parallelism, but a
"maintenance" job like the WAL prefetcher should be allowed to use a
lot more. That's why the existing VACUUM code mentioned in thread [2]/messages/by-id/CA+TgmoZP-CTmEPZdmqEOb+6t_Tts2nuF7eoqxxvXEHaUoBDmsw@mail.gmail.com
already does "+ 10".
Maybe "maintenance" isn't the best word for this, but that's the idea.
[1]: /messages/by-id/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com
[2]: /messages/by-id/CA+TgmoZP-CTmEPZdmqEOb+6t_Tts2nuF7eoqxxvXEHaUoBDmsw@mail.gmail.com
On Sat, Mar 7, 2020 at 8:00 AM Andres Freund <andres@anarazel.de> wrote:
On 2020-03-02 18:28:41 +1300, Thomas Munro wrote:
* leaves room for a different setting for sequential prefetching
I think if we want to split those at some point, we ought to split it if
we have a good reason, not before. It's not at all clear to me why you'd
want a substantially different queue depth for both.
Alright, I retract that part. It's known that at least on some
systems you might want to suppress that (due to some kind of bad
interaction with kernel readahead heuristics). But that isn't really
an argument for having a different queue size, it's an argument for
having a separate on/off switch.
* "degree" conveys the idea without using loaded words like "queue"
that might imply we know something about the I/O subsystem or that
it's system-wide like kernel and device queuesWhy is that good? Queue depth is a pretty well established term. You can
search for benchmarks of devices with it, you can correlate with OS
config, etc.
Queue depth is the standard term for an I/O queue that is shared by
all users. What we're talking about here is undeniably also a queue
with a depth, but it's a limit on the amount of concurrent I/O that
*each operator in a query* will try to initiate (for example: each
bitmap heap scan in the query, in future perhaps btree scans and other
things), so I was thinking that we might want a different name.
The more I think about this the more I appreciate the current vague GUC name!
On Sat, Mar 7, 2020 at 8:35 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
I think the main issue with keeping the current GUC name is that if you
had a value that worked, we'll silently interpret it differently. Which
is a bit annoying :-(
Yeah. Perhaps we should just give the formula for translating v12
settings to v13 settings in the release notes. If we don't rename the
GUC, you won't be forced to contemplate this when you upgrade, so the
amount of prefetching we do will go down a bit given the same value.
That is indeed what led me to start thinking about what a good new
name would be. Now that I've been talked out of the "random_page"
part, your names look like sensible candidates, but I wonder if there
is some way to capture that it's "per operation"...
On Mar 7, 2020, at 00:33, Thomas Munro <thomas.munro@gmail.com> wrote:
That is indeed what led me to start thinking about what a good new
name would be.
MySQL has a term io_capacity.
https://dev.mysql.com/doc/refman/8.0/en/innodb-configuring-io-capacity.html
The innodb_io_capacity variable defines the overall I/O capacity available to InnoDB. It should be set to approximately the number of I/O operations that the system can perform per second (IOPS). When innodb_io_capacity is set, InnoDB estimates the I/O bandwidth available for background tasks based on the set value.
Perhaps we can have maintenance_io_capacity as well.
On Sat, Mar 7, 2020 at 11:54 PM Evgeniy Shishkin <itparanoia@gmail.com> wrote:
On Mar 7, 2020, at 00:33, Thomas Munro <thomas.munro@gmail.com> wrote:
That is indeed what led me to start thinking about what a good new
name would be.MySQL has a term io_capacity.
https://dev.mysql.com/doc/refman/8.0/en/innodb-configuring-io-capacity.htmlThe innodb_io_capacity variable defines the overall I/O capacity available to InnoDB. It should be set to approximately the number of I/O operations that the system can perform per second (IOPS). When innodb_io_capacity is set, InnoDB estimates the I/O bandwidth available for background tasks based on the set value.
Perhaps we can have maintenance_io_capacity as well.
That sounds like total I/O capacity for your system that will be
shared out for various tasks, which would definitely be nice to have,
but here we're talking about a simpler per-operation settings. What
we have is a bit like work_mem (a memory limit used for each
individual hash, sort, tuplestore, ...), compared to a hypothetical
whole-system memory budget (which would definitely also be nice to
have).
On Sat, Mar 7, 2020 at 9:07 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
So I think we should either rename e_i_c or keep it as is, and then also
have a new GUC. And then translate the values between those (but that
might be overkill).Please DON'T try to have two interrelated GUCs for this. We learned
our lesson about that years ago.
Ack.
I think dropping the existing GUC is a perfectly sane thing to do,
if the new definition wouldn't be compatible. In practice few
people will notice, because few will have set it.
That's what I thought too, but if Andres is right that "it's not like
anybody knew how to infer a useful value", I'm wondering it's enough
if we just provide an explanation of the change in the release notes.
The default doesn't change (1 goes to 1), so most people will
experience no change, but it you had it set to (say) 42 after careful
experimentation, you might like to consider updating it to the result
of:
select round(sum(42 / n::float)) as new_setting from
generate_series(1, 42) s(n)
Here's a patch set to remove the spindle stuff, add a maintenance
variant, and use the maintenance one in
heap_compute_xid_horizon_for_tuples().
Attachments:
0001-Simplify-the-effective_io_concurrency-setting.patchtext/x-patch; charset=US-ASCII; name=0001-Simplify-the-effective_io_concurrency-setting.patchDownload
From fc472c5b473e3495e92ec9cf021b6e6393579ca5 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 7 Mar 2020 15:04:34 +1300
Subject: [PATCH 1/2] Simplify the effective_io_concurrency setting.
The effective_io_concurrency GUC and tablespace option were previously not used
directly to control the number of prefetches we'd issue. Instead, they were
passed through a formula based on a theory about RAID spindles and
probabilities. Today, a lot of people are familiar with the concept of I/O
request queues and have never seen a spinning disk, so let's switch to
something a little easier to justify.
If users were using a setting higher than the default value 1, then they will
need to increase the value for the same effect. The query to compute the new
setting corresponding to the old setting OLD is:
select round(sum(OLD / n::float)) as new_setting
from generate_series(1, OLD) s(n);
Discussion: https://postgr.es/m/CA%2BhUKGJUw08dPs_3EUcdO6M90GnjofPYrWp4YSLaBkgYwS-AqA%40mail.gmail.com
---
src/backend/executor/nodeBitmapHeapscan.c | 18 +-----
src/backend/storage/buffer/bufmgr.c | 74 +++--------------------
src/backend/utils/misc/guc.c | 29 +--------
src/include/storage/bufmgr.h | 1 -
4 files changed, 13 insertions(+), 109 deletions(-)
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index ae8a11da30..726d3a2d9a 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -707,7 +707,6 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
{
BitmapHeapScanState *scanstate;
Relation currentRelation;
- int io_concurrency;
/* check for unsupported flags */
Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -737,8 +736,6 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
scanstate->prefetch_iterator = NULL;
scanstate->prefetch_pages = 0;
scanstate->prefetch_target = 0;
- /* may be updated below */
- scanstate->prefetch_maximum = target_prefetch_pages;
scanstate->pscan_len = 0;
scanstate->initialized = false;
scanstate->shared_tbmiterator = NULL;
@@ -794,20 +791,11 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
ExecInitQual(node->bitmapqualorig, (PlanState *) scanstate);
/*
- * Determine the maximum for prefetch_target. If the tablespace has a
- * specific IO concurrency set, use that to compute the corresponding
- * maximum value; otherwise, we already initialized to the value computed
- * by the GUC machinery.
+ * Maximum number of prefetches for the tablespace if configured, otherwise
+ * the current value of the effective_io_concurrency GUC.
*/
- io_concurrency =
+ scanstate->prefetch_maximum =
get_tablespace_io_concurrency(currentRelation->rd_rel->reltablespace);
- if (io_concurrency != effective_io_concurrency)
- {
- double maximum;
-
- if (ComputeIoConcurrency(io_concurrency, &maximum))
- scanstate->prefetch_maximum = rint(maximum);
- }
scanstate->ss.ss_currentRelation = currentRelation;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 5880054245..7a7748b695 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -110,6 +110,13 @@ bool zero_damaged_pages = false;
int bgwriter_lru_maxpages = 100;
double bgwriter_lru_multiplier = 2.0;
bool track_io_timing = false;
+
+/*
+ * How many buffers PrefetchBuffer callers should try to stay ahead of their
+ * ReadBuffer calls by. Zero means "never prefetch". This value is only used
+ * for buffers not belonging to tablespaces that have their
+ * effective_io_concurrency parameter set.
+ */
int effective_io_concurrency = 0;
/*
@@ -120,15 +127,6 @@ int checkpoint_flush_after = 0;
int bgwriter_flush_after = 0;
int backend_flush_after = 0;
-/*
- * How many buffers PrefetchBuffer callers should try to stay ahead of their
- * ReadBuffer calls by. This is maintained by the assign hook for
- * effective_io_concurrency. Zero means "never prefetch". This value is
- * only used for buffers not belonging to tablespaces that have their
- * effective_io_concurrency parameter set.
- */
-int target_prefetch_pages = 0;
-
/* local state for StartBufferIO and related functions */
static BufferDesc *InProgressBuf = NULL;
static bool IsForInput;
@@ -461,64 +459,6 @@ static int ckpt_buforder_comparator(const void *pa, const void *pb);
static int ts_ckpt_progress_comparator(Datum a, Datum b, void *arg);
-/*
- * ComputeIoConcurrency -- get the number of pages to prefetch for a given
- * number of spindles.
- */
-bool
-ComputeIoConcurrency(int io_concurrency, double *target)
-{
- double new_prefetch_pages = 0.0;
- int i;
-
- /*
- * Make sure the io_concurrency value is within valid range; it may have
- * been forced with a manual pg_tablespace update.
- */
- io_concurrency = Min(Max(io_concurrency, 0), MAX_IO_CONCURRENCY);
-
- /*----------
- * The user-visible GUC parameter is the number of drives (spindles),
- * which we need to translate to a number-of-pages-to-prefetch target.
- * The target value is stashed in *extra and then assigned to the actual
- * variable by assign_effective_io_concurrency.
- *
- * The expected number of prefetch pages needed to keep N drives busy is:
- *
- * drives | I/O requests
- * -------+----------------
- * 1 | 1
- * 2 | 2/1 + 2/2 = 3
- * 3 | 3/1 + 3/2 + 3/3 = 5 1/2
- * 4 | 4/1 + 4/2 + 4/3 + 4/4 = 8 1/3
- * n | n * H(n)
- *
- * This is called the "coupon collector problem" and H(n) is called the
- * harmonic series. This could be approximated by n * ln(n), but for
- * reasonable numbers of drives we might as well just compute the series.
- *
- * Alternatively we could set the target to the number of pages necessary
- * so that the expected number of active spindles is some arbitrary
- * percentage of the total. This sounds the same but is actually slightly
- * different. The result ends up being ln(1-P)/ln((n-1)/n) where P is
- * that desired fraction.
- *
- * Experimental results show that both of these formulas aren't aggressive
- * enough, but we don't really have any better proposals.
- *
- * Note that if io_concurrency = 0 (disabled), we must set target = 0.
- *----------
- */
-
- for (i = 1; i <= io_concurrency; i++)
- new_prefetch_pages += (double) io_concurrency / (double) i;
-
- *target = new_prefetch_pages;
-
- /* This range check shouldn't fail, but let's be paranoid */
- return (new_prefetch_pages >= 0.0 && new_prefetch_pages < (double) INT_MAX);
-}
-
/*
* PrefetchBuffer -- initiate asynchronous read of a block of a relation
*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index c1fad3b350..45dbf270bf 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -195,7 +195,6 @@ static bool check_autovacuum_max_workers(int *newval, void **extra, GucSource so
static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
-static void assign_effective_io_concurrency(int newval, void *extra);
static void assign_pgstat_temp_directory(const char *newval, void *extra);
static bool check_application_name(char **newval, void **extra, GucSource source);
static void assign_application_name(const char *newval, void *extra);
@@ -2881,7 +2880,7 @@ static struct config_int ConfigureNamesInt[] =
0,
#endif
0, MAX_IO_CONCURRENCY,
- check_effective_io_concurrency, assign_effective_io_concurrency, NULL
+ check_effective_io_concurrency, NULL, NULL
},
{
@@ -11456,36 +11455,14 @@ check_max_worker_processes(int *newval, void **extra, GucSource source)
static bool
check_effective_io_concurrency(int *newval, void **extra, GucSource source)
{
-#ifdef USE_PREFETCH
- double new_prefetch_pages;
-
- if (ComputeIoConcurrency(*newval, &new_prefetch_pages))
- {
- int *myextra = (int *) guc_malloc(ERROR, sizeof(int));
-
- *myextra = (int) rint(new_prefetch_pages);
- *extra = (void *) myextra;
-
- return true;
- }
- else
- return false;
-#else
+#ifndef USE_PREFETCH
if (*newval != 0)
{
GUC_check_errdetail("effective_io_concurrency must be set to 0 on platforms that lack posix_fadvise().");
return false;
}
- return true;
-#endif /* USE_PREFETCH */
-}
-
-static void
-assign_effective_io_concurrency(int newval, void *extra)
-{
-#ifdef USE_PREFETCH
- target_prefetch_pages = *((int *) extra);
#endif /* USE_PREFETCH */
+ return true;
}
static void
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 73c7e9ba38..0b082ea79f 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -161,7 +161,6 @@ extern PGDLLIMPORT int32 *LocalRefCount;
/*
* prototypes for functions in bufmgr.c
*/
-extern bool ComputeIoConcurrency(int io_concurrency, double *target);
extern void PrefetchBuffer(Relation reln, ForkNumber forkNum,
BlockNumber blockNum);
extern Buffer ReadBuffer(Relation reln, BlockNumber blockNum);
--
2.20.1
0002-Introduce-a-maintenance_io_concurrency-setting.patchtext/x-patch; charset=US-ASCII; name=0002-Introduce-a-maintenance_io_concurrency-setting.patchDownload
From d7f04fbbea2d69530be9ea13b7dfb175d37c319c Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 10 Mar 2020 11:49:40 +1300
Subject: [PATCH 2/2] Introduce a maintenance_io_concurrency setting.
Introduce a GUC and a tablespace option to control I/O prefetching, much
like effective_io_concurrency, but for maintenance tasks that act on
behalf of many regular user backends.
Use the new setting in heapam.c instead of the hard-coded formula
effective_io_concurrency + 10 introduced by commit 558a9165e08. Take
inspiration from that commit to choose a default value of 10 for
maintenance_io_concurrency.
Discussion: https://postgr.es/m/CA%2BhUKGJUw08dPs_3EUcdO6M90GnjofPYrWp4YSLaBkgYwS-AqA%40mail.gmail.com
---
doc/src/sgml/config.sgml | 16 +++++++++++++
doc/src/sgml/ref/alter_tablespace.sgml | 11 +++++----
doc/src/sgml/ref/create_tablespace.sgml | 11 +++++----
src/backend/access/common/reloptions.c | 16 ++++++++++++-
src/backend/access/heap/heapam.c | 16 +++----------
src/backend/storage/buffer/bufmgr.c | 6 +++++
src/backend/utils/cache/spccache.c | 14 +++++++++++
src/backend/utils/misc/guc.c | 32 +++++++++++++++++++++++++
src/bin/psql/tab-complete.c | 2 +-
src/include/commands/tablespace.h | 1 +
src/include/storage/bufmgr.h | 1 +
src/include/utils/spccache.h | 1 +
12 files changed, 104 insertions(+), 23 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c1128f89ec..52bb3f749f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2229,6 +2229,22 @@ include_dir 'conf.d'
</listitem>
</varlistentry>
+ <varlistentry id="guc-maintenance-io-concurrency" xreflabel="maintenance_io_concurrency">
+ <term><varname>maintenance_io_concurrency</varname> (<type>integer</type>)
+ <indexterm>
+ <primary><varname>maintenance_io_concurrency</varname> configuration parameter</primary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Similar to <varname>effective_io_concurrency</varname>, but used
+ for maintenance tasks such as VACUUM. The default value is higher,
+ because maintenance processes work on behalf of many concurrent
+ regular processes.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="guc-max-worker-processes" xreflabel="max_worker_processes">
<term><varname>max_worker_processes</varname> (<type>integer</type>)
<indexterm>
diff --git a/doc/src/sgml/ref/alter_tablespace.sgml b/doc/src/sgml/ref/alter_tablespace.sgml
index acec33469f..356fb9f93f 100644
--- a/doc/src/sgml/ref/alter_tablespace.sgml
+++ b/doc/src/sgml/ref/alter_tablespace.sgml
@@ -84,13 +84,16 @@ ALTER TABLESPACE <replaceable>name</replaceable> RESET ( <replaceable class="par
<para>
A tablespace parameter to be set or reset. Currently, the only
available parameters are <varname>seq_page_cost</varname>,
- <varname>random_page_cost</varname> and <varname>effective_io_concurrency</varname>.
- Setting either value for a particular tablespace will override the
+ <varname>random_page_cost</varname>, <varname>effective_io_concurrency</varname>
+ and <varname>maintenance_io_concurrency</varname>.
+ Setting these values for a particular tablespace will override the
planner's usual estimate of the cost of reading pages from tables in
- that tablespace, as established by the configuration parameters of the
+ that tablespace, and the executor's prefetching behavior, as established
+ by the configuration parameters of the
same name (see <xref linkend="guc-seq-page-cost"/>,
<xref linkend="guc-random-page-cost"/>,
- <xref linkend="guc-effective-io-concurrency"/>). This may be useful if
+ <xref linkend="guc-effective-io-concurrency"/>,
+ <xref linkend="guc-maintenance-io-concurrency"/>). This may be useful if
one tablespace is located on a disk which is faster or slower than the
remainder of the I/O subsystem.
</para>
diff --git a/doc/src/sgml/ref/create_tablespace.sgml b/doc/src/sgml/ref/create_tablespace.sgml
index c621ec2c6b..462b8831c2 100644
--- a/doc/src/sgml/ref/create_tablespace.sgml
+++ b/doc/src/sgml/ref/create_tablespace.sgml
@@ -106,13 +106,16 @@ CREATE TABLESPACE <replaceable class="parameter">tablespace_name</replaceable>
<para>
A tablespace parameter to be set or reset. Currently, the only
available parameters are <varname>seq_page_cost</varname>,
- <varname>random_page_cost</varname> and <varname>effective_io_concurrency</varname>.
- Setting either value for a particular tablespace will override the
+ <varname>random_page_cost</varname>, <varname>effective_io_concurrency</varname>
+ and <varname>maintenance_io_concurrency</varname>.
+ Setting these values for a particular tablespace will override the
planner's usual estimate of the cost of reading pages from tables in
- that tablespace, as established by the configuration parameters of the
+ that tablespace, and the executor's prefetching behavior, as established
+ by the configuration parameters of the
same name (see <xref linkend="guc-seq-page-cost"/>,
<xref linkend="guc-random-page-cost"/>,
- <xref linkend="guc-effective-io-concurrency"/>). This may be useful if
+ <xref linkend="guc-effective-io-concurrency"/>,
+ <xref linkend="guc-maintenance-io-concurrency"/>). This may be useful if
one tablespace is located on a disk which is faster or slower than the
remainder of the I/O subsystem.
</para>
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index c3d45c7a24..654ed7993a 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -349,6 +349,19 @@ static relopt_int intRelOpts[] =
-1, 0, MAX_IO_CONCURRENCY
#else
0, 0, 0
+#endif
+ },
+ {
+ {
+ "maintenance_io_concurrency",
+ "Number of simultaneous requests that can be handled efficiently by the disk subsystem for maintenance tasks.",
+ RELOPT_KIND_TABLESPACE,
+ ShareUpdateExclusiveLock
+ },
+#ifdef USE_PREFETCH
+ -1, 0, MAX_IO_CONCURRENCY
+#else
+ 0, 0, 0
#endif
},
{
@@ -1700,7 +1713,8 @@ tablespace_reloptions(Datum reloptions, bool validate)
static const relopt_parse_elt tab[] = {
{"random_page_cost", RELOPT_TYPE_REAL, offsetof(TableSpaceOpts, random_page_cost)},
{"seq_page_cost", RELOPT_TYPE_REAL, offsetof(TableSpaceOpts, seq_page_cost)},
- {"effective_io_concurrency", RELOPT_TYPE_INT, offsetof(TableSpaceOpts, effective_io_concurrency)}
+ {"effective_io_concurrency", RELOPT_TYPE_INT, offsetof(TableSpaceOpts, effective_io_concurrency)},
+ {"maintenance_io_concurrency", RELOPT_TYPE_INT, offsetof(TableSpaceOpts, maintenance_io_concurrency)}
};
return (bytea *) build_reloptions(reloptions, validate,
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5a32e62ed0..29694b8aa4 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -7003,7 +7003,6 @@ heap_compute_xid_horizon_for_tuples(Relation rel,
Page hpage;
#ifdef USE_PREFETCH
XidHorizonPrefetchState prefetch_state;
- int io_concurrency;
int prefetch_distance;
#endif
@@ -7026,24 +7025,15 @@ heap_compute_xid_horizon_for_tuples(Relation rel,
/*
* Compute the prefetch distance that we will attempt to maintain.
*
- * We don't use the regular formula to determine how much to prefetch
- * here, but instead just add a constant to effective_io_concurrency.
- * That's because it seems best to do some prefetching here even when
- * effective_io_concurrency is set to 0, but if the DBA thinks it's OK to
- * do more prefetching for other operations, then it's probably OK to do
- * more prefetching in this case, too. It may be that this formula is too
- * simplistic, but at the moment there is no evidence of that or any idea
- * about what would work better.
- *
* Since the caller holds a buffer lock somewhere in rel, we'd better make
* sure that isn't a catalog relation before we call code that does
* syscache lookups, to avoid risk of deadlock.
*/
if (IsCatalogRelation(rel))
- io_concurrency = effective_io_concurrency;
+ prefetch_distance = maintenance_io_concurrency;
else
- io_concurrency = get_tablespace_io_concurrency(rel->rd_rel->reltablespace);
- prefetch_distance = Min((io_concurrency) + 10, MAX_IO_CONCURRENCY);
+ prefetch_distance =
+ get_tablespace_maintenance_io_concurrency(rel->rd_rel->reltablespace);
/* Start prefetching. */
xid_horizon_prefetch_buffer(rel, &prefetch_state, prefetch_distance);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7a7748b695..7ab3577ad4 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -119,6 +119,12 @@ bool track_io_timing = false;
*/
int effective_io_concurrency = 0;
+/*
+ * Like effective_io_concurrency, but used by maintenance code paths that might
+ * benefit from a higher setting because they work on behalf of many backends.
+ */
+int maintenance_io_concurrency = 0;
+
/*
* GUC variables about triggering kernel writeback for buffers written; OS
* dependent defaults are set via the GUC mechanism.
diff --git a/src/backend/utils/cache/spccache.c b/src/backend/utils/cache/spccache.c
index c4a0f719fb..e0c3c1b1c1 100644
--- a/src/backend/utils/cache/spccache.c
+++ b/src/backend/utils/cache/spccache.c
@@ -221,3 +221,17 @@ get_tablespace_io_concurrency(Oid spcid)
else
return spc->opts->effective_io_concurrency;
}
+
+/*
+ * get_tablespace_maintenance_io_concurrency
+ */
+int
+get_tablespace_maintenance_io_concurrency(Oid spcid)
+{
+ TableSpaceCacheEntry *spc = get_tablespace(spcid);
+
+ if (!spc->opts || spc->opts->maintenance_io_concurrency < 0)
+ return maintenance_io_concurrency;
+ else
+ return spc->opts->maintenance_io_concurrency;
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 45dbf270bf..668870f30f 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -195,6 +195,7 @@ static bool check_autovacuum_max_workers(int *newval, void **extra, GucSource so
static bool check_max_wal_senders(int *newval, void **extra, GucSource source);
static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
+static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source);
static void assign_pgstat_temp_directory(const char *newval, void *extra);
static bool check_application_name(char **newval, void **extra, GucSource source);
static void assign_application_name(const char *newval, void *extra);
@@ -2883,6 +2884,24 @@ static struct config_int ConfigureNamesInt[] =
check_effective_io_concurrency, NULL, NULL
},
+ {
+ {"maintenance_io_concurrency",
+ PGC_USERSET,
+ RESOURCES_ASYNCHRONOUS,
+ gettext_noop("Like effective_io_concurrency, but a separate setting used by maintenance tasks."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &maintenance_io_concurrency,
+#ifdef USE_PREFETCH
+ 10,
+#else
+ 0,
+#endif
+ 0, MAX_IO_CONCURRENCY,
+ check_maintenance_io_concurrency, NULL, NULL
+ },
+
{
{"backend_flush_after", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
@@ -11465,6 +11484,19 @@ check_effective_io_concurrency(int *newval, void **extra, GucSource source)
return true;
}
+static bool
+check_maintenance_io_concurrency(int *newval, void **extra, GucSource source)
+{
+#ifndef USE_PREFETCH
+ if (*newval != 0)
+ {
+ GUC_check_errdetail("maintenance_io_concurrency must be set to 0 on platforms that lack posix_fadvise().");
+ return false;
+ }
+#endif /* USE_PREFETCH */
+ return true;
+}
+
static void
assign_pgstat_temp_directory(const char *newval, void *extra)
{
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 174c3db623..ae35fa4aa9 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2140,7 +2140,7 @@ psql_completion(const char *text, int start, int end)
/* ALTER TABLESPACE <foo> SET|RESET ( */
else if (Matches("ALTER", "TABLESPACE", MatchAny, "SET|RESET", "("))
COMPLETE_WITH("seq_page_cost", "random_page_cost",
- "effective_io_concurrency");
+ "effective_io_concurrency", "maintenance_io_concurrency");
/* ALTER TEXT SEARCH */
else if (Matches("ALTER", "TEXT", "SEARCH"))
diff --git a/src/include/commands/tablespace.h b/src/include/commands/tablespace.h
index 41c457052d..fd1b28fca2 100644
--- a/src/include/commands/tablespace.h
+++ b/src/include/commands/tablespace.h
@@ -40,6 +40,7 @@ typedef struct TableSpaceOpts
float8 random_page_cost;
float8 seq_page_cost;
int effective_io_concurrency;
+ int maintenance_io_concurrency;
} TableSpaceOpts;
extern Oid CreateTableSpace(CreateTableSpaceStmt *stmt);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 0b082ea79f..428d8005e7 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -68,6 +68,7 @@ extern PGDLLIMPORT char *BufferBlocks;
/* in guc.c */
extern int effective_io_concurrency;
+extern int maintenance_io_concurrency;
/* in localbuf.c */
extern PGDLLIMPORT int NLocBuffer;
diff --git a/src/include/utils/spccache.h b/src/include/utils/spccache.h
index 5112ba3c37..7e4ec69aa2 100644
--- a/src/include/utils/spccache.h
+++ b/src/include/utils/spccache.h
@@ -16,5 +16,6 @@
void get_tablespace_page_costs(Oid spcid, float8 *spc_random_page_cost,
float8 *spc_seq_page_cost);
int get_tablespace_io_concurrency(Oid spcid);
+int get_tablespace_maintenance_io_concurrency(Oid spcid);
#endif /* SPCCACHE_H */
--
2.20.1
On Tue, Mar 10, 2020 at 12:20 PM Thomas Munro <thomas.munro@gmail.com> wrote:
Here's a patch set to remove the spindle stuff, add a maintenance
variant, and use the maintenance one in
heap_compute_xid_horizon_for_tuples().
Pushed.
On Sun, Mar 15, 2020 at 9:27 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Tue, Mar 10, 2020 at 12:20 PM Thomas Munro <thomas.munro@gmail.com> wrote:
Here's a patch set to remove the spindle stuff, add a maintenance
variant, and use the maintenance one in
heap_compute_xid_horizon_for_tuples().Pushed.
Shouldn't you close out the "Should we rename
effective_io_concurrency?" Postgres 13 open item now?
--
Peter Geoghegan
On Wed, May 13, 2020 at 6:58 AM Peter Geoghegan <pg@bowt.ie> wrote:
Shouldn't you close out the "Should we rename
effective_io_concurrency?" Postgres 13 open item now?
Yeah, that doesn't really seem worth the churn. I'll move it to the
resolved list in a day or two if no one shows up to argue for a
rename.