Changing the default random_page_cost value
tl;dr let's assume SSDs are popular and HDDs are the exception and flip our
default
As I write this email, it's the year 2024. I think it is time we lower our
"default" setting of random_page_cost (as set in postgresql.conf.sample and
the docs). Even a decade ago, the current default of 4 was considered
fairly conservative and often lowered. The git logs shows that this value
was last touched in 2006, during the age of spinning metal. We are now in a
new era, the age of SSDs, and thus we should lower this default value to
reflect the fact that the vast majority of people using Postgres these days
are doing so on solid state drives. We tend to stay ultra-conservative in
all of our settings, but we also need to recognize when there has been a
major shift in the underlying hardware - and calculations that our defaults
are based on.
Granted, there are other factors involved, and yes, perhaps we should tweak
some of the similar settings as well, but ranom_page_cost is the one
setting most out of sync with today's hardware realities. So I'll be brave
and throw a number out there: 1.2. And change our docs to say wordage like
"if you are using an older hard disk drive technology, you may want to try
raising rpc" to replace our fairly-hidden note about SSDs buried in the
last sentence - of the fourth paragraph - of the rpc docs.
Real data about performance on today's SSDs are welcome, and/or some way to
generate a more accurate default.
Cheers,
Greg
On Fri, Sep 27, 2024 at 8:07 AM Greg Sabino Mullane <htamfids@gmail.com>
wrote:
tl;dr let's assume SSDs are popular and HDDs are the exception and flip
our default
<snip>
Granted, there are other factors involved, and yes, perhaps we should
tweak some of the similar settings as well, but ranom_page_cost is the one
setting most out of sync with today's hardware realities. So I'll be brave
and throw a number out there: 1.2. And change our docs to say wordage like
"if you are using an older hard disk drive technology, you may want to try
raising rpc" to replace our fairly-hidden note about SSDs buried in the
last sentence - of the fourth paragraph - of the rpc docs.
+1
I suggest a slightly nicer comment in the default conf file, like "For
spinning hard drives, raise this to at least 3 and test"
Roberto
On Fri, 2024-09-27 at 10:07 -0400, Greg Sabino Mullane wrote:
So I'll be brave and throw a number out there: 1.2.
+1
Laurenz Albe
Greg Sabino Mullane <htamfids@gmail.com> writes:
So I'll be brave and throw a number out there: 1.2. And change our
docs to say wordage like "if you are using an older hard disk drive
technology, you may want to try raising rpc" to replace our
fairly-hidden note about SSDs buried in the last sentence - of the
fourth paragraph - of the rpc docs.
It might also be worth mentioning cloudy block storage (e.g. AWS' EBS),
which is typically backed by SSDs, but has extra network latency.
- ilmari
On Fri, Sep 27, 2024 at 12:03 PM Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
wrote:
It might also be worth mentioning cloudy block storage (e.g. AWS' EBS),
which is typically backed by SSDs, but has extra network latency.
That seems a little too in the weeds for me, but wording suggestions are
welcome. To get things moving forward, I made a doc patch which changes a
few things, namely:
* Mentions the distinction between ssd and hdd right up front.
* Moves the tablespace talk to the very end, as tablespace use is getting
rarer (again, thanks in part to ssds)
* Mentions the capability to set per-database and per-role since we mention
per-tablespace.
* Removes a lot of the talk of caches and justifications for the 4.0
setting. While those are interesting, I've been tuning this parameter for
many years and never really cared about the "90% cache rate". The proof is
in the pudding: rpc is the canonical "try it and see" parameter. Tweak.
Test. Repeat.
Cheers,
Greg
Attachments:
0001-Lower-random_page_cost-default-to-1.2-and-update-docs-about-it.patchapplication/octet-stream; name=0001-Lower-random_page_cost-default-to-1.2-and-update-docs-about-it.patchDownload
From 50f4800e3501a84c0d41a5e29cf384bf465f230d Mon Sep 17 00:00:00 2001
From: Greg Sabino Mullane <greg@turnstep.com>
Date: Mon, 30 Sep 2024 09:52:28 -0400
Subject: [PATCH] Lower random_page_cost default to 1.2 and update docs about
it.
---
doc/src/sgml/config.sgml | 35 ++++++-------------
src/backend/utils/misc/postgresql.conf.sample | 2 +-
2 files changed, 12 insertions(+), 25 deletions(-)
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0aec11f443..e901a71613 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5728,14 +5728,15 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
<listitem>
<para>
Sets the planner's estimate of the cost of a
- non-sequentially-fetched disk page. The default is 4.0.
- This value can be overridden for tables and indexes in a particular
- tablespace by setting the tablespace parameter of the same name
- (see <xref linkend="sql-altertablespace"/>).
- </para>
+ non-sequentially-fetched disk page. The default is <literal>1.2</literal>.
+ Note that this value assumes use of a solid-state drive, which
+ has a low random read cost relative to sequential scans. If you
+ are a using a hard disk drive, raising the value to <literal>3</literal>
+ or <literal>4</literal> may give you better performance.
+ </para>
<para>
- Reducing this value relative to <varname>seq_page_cost</varname>
+ Reducing <varname>random_page_cost</varname> relative to <varname>seq_page_cost</varname>
will cause the system to prefer index scans; raising it will
make index scans look relatively more expensive. You can raise
or lower both values together to change the importance of disk I/O
@@ -5744,24 +5745,10 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
</para>
<para>
- Random access to mechanical disk storage is normally much more expensive
- than four times sequential access. However, a lower default is used
- (4.0) because the majority of random accesses to disk, such as indexed
- reads, are assumed to be in cache. The default value can be thought of
- as modeling random access as 40 times slower than sequential, while
- expecting 90% of random reads to be cached.
- </para>
-
- <para>
- If you believe a 90% cache rate is an incorrect assumption
- for your workload, you can increase random_page_cost to better
- reflect the true cost of random storage reads. Correspondingly,
- if your data is likely to be completely in cache, such as when
- the database is smaller than the total server memory, decreasing
- random_page_cost can be appropriate. Storage that has a low random
- read cost relative to sequential, e.g., solid-state drives, might
- also be better modeled with a lower value for random_page_cost,
- e.g., <literal>1.1</literal>.
+ This value can be overridden for tables and indexes in a particular
+ tablespace by setting the tablespace parameter of the same name
+ (see <xref linkend="sql-altertablespace"/>). It can also be set
+ per-database and per-role, as needed.
</para>
<tip>
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 667e0dc40a..424df4a868 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -417,7 +417,7 @@
# - Planner Cost Constants -
#seq_page_cost = 1.0 # measured on an arbitrary scale
-#random_page_cost = 4.0 # same scale as above
+#random_page_cost = 1.2 # same scale as above
#cpu_tuple_cost = 0.01 # same scale as above
#cpu_index_tuple_cost = 0.005 # same scale as above
#cpu_operator_cost = 0.0025 # same scale as above
--
2.30.2
On Mon, Sep 30, 2024 at 10:05:29AM -0400, Greg Sabino Mullane wrote:
On Fri, Sep 27, 2024 at 12:03 PM Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
wrote:It might also be worth mentioning cloudy block storage (e.g. AWS' EBS),
which is typically backed by SSDs, but has extra network latency.That seems a little too in the weeds for me, but wording suggestions are
welcome. To get things moving forward, I made a doc patch which changes a few
things, namely:* Mentions the distinction between ssd and hdd right up front.
* Moves the tablespace talk to the very end, as tablespace use is getting rarer
(again, thanks in part to ssds)
* Mentions the capability to set per-database and per-role since we mention
per-tablespace.
* Removes a lot of the talk of caches and justifications for the 4.0 setting.
While those are interesting, I've been tuning this parameter for many years and
never really cared about the "90% cache rate". The proof is in the pudding: rpc
is the canonical "try it and see" parameter. Tweak. Test. Repeat.
I am not a fan of this patch. I don't see why _removing_ the magnetic
part helps because you then have no logic for any 1.2 was chosen. I
would put the magnetic in a separate paragraph perhaps, and recommend
4.0 for it. Also, per-tablespace makes sense because of physical media
differences, but what purpose would per-database and per-role serve?
Also, per-tablespace is not a connection-activated item like the other
two.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"
On Tue, 15 Oct 2024 at 10:15, Bruce Momjian <bruce@momjian.us> wrote:
I am not a fan of this patch. I don't see why _removing_ the magnetic
part helps because you then have no logic for any 1.2 was chosen. I
would put the magnetic in a separate paragraph perhaps, and recommend
4.0 for it. Also, per-tablespace makes sense because of physical media
differences, but what purpose would per-database and per-role serve?
Also, per-tablespace is not a connection-activated item like the other
two.
Yeah, I think any effort to change the default value for this setting
would require some analysis to prove that the newly proposed default
is a more suitable setting than the current default. I mean, why 1.2
and not 1.1 or 1.3? Where's the evidence that 1.2 is the best value
for this?
I don't think just providing evidence that random read times are
closer to sequential read times on SSDs are closer than they are with
HDDs is going to be enough. What we want to know is if the planner
costs become more closely related to the execution time or not. From
my experience, random_page_cost really only has a loose grasp on
reality, so you might find that it's hard to prove this with any
degree of confidence (just have a look at how inconsiderate
index_pages_fetched() is to other queries running on the database, for
example).
I suggest first identifying all the locations that use
random_page_cost then coming up with some test cases that run queries
that exercise those locations in a way that does things like vary the
actual selectivity of some value to have the planner switch plans then
try varying the random_page_cost to show that the switchover point is
more correct with the new value than it is with the old value. It
would be nice to have this as a script so that other people could
easily run it on their hardware to ensure that random_page_cost we
choose as the new default is representative of the average hardware.
You'll likely need to do this with varying index sizes. I imagine to
properly test this so that we'd have any useful degree of confidence
that the new value is better than the old one would likely require a
benchmark that runs for several hours. At the upper end, you'd likely
want the data sizes to exceed the size of RAM. Another dimension that
the tests should likely explore is varying data locality.
David
David Rowley <dgrowleyml@gmail.com> writes:
Yeah, I think any effort to change the default value for this setting
would require some analysis to prove that the newly proposed default
is a more suitable setting than the current default. I mean, why 1.2
and not 1.1 or 1.3? Where's the evidence that 1.2 is the best value
for this?
Yeah, that's been my main concern about this proposal too.
I recall that when we settled on 4.0 as a good number for
spinning-rust drives, it came out of some experimentation that
I'd done that involved multiple-day-long tests. I don't recall any
more details than that sadly, but perhaps trawling the mailing list
archives would yield useful info. It looks like the 4.0 value came
in with b1577a7c7 of 2000-02-15, so late 1999/early 2000 would be the
time frame to look in.
regards, tom lane
I wrote:
I recall that when we settled on 4.0 as a good number for
spinning-rust drives, it came out of some experimentation that
I'd done that involved multiple-day-long tests. I don't recall any
more details than that sadly, but perhaps trawling the mailing list
archives would yield useful info. It looks like the 4.0 value came
in with b1577a7c7 of 2000-02-15, so late 1999/early 2000 would be the
time frame to look in.
I tried asking https://www.postgresql.org/search/ about
random_page_cost, and got nothing except search engine timeouts :-(.
However, some digging in my own local archives yielded
/messages/by-id/25387.948414692@sss.pgh.pa.us
/messages/by-id/14601.949786166@sss.pgh.pa.us
That confirms my recollection of multiple-day test runs, but doesn't
offer much more useful detail than that :-(. What I think I did
though was to create some large tables (much bigger than the RAM on
the machine I had) and actually measure the runtime of seq scans
versus full-table index scans, repeating plenty 'o times to try to
average out the noise. There was some talk in those threads of
reducing that to a publishable script, but it was never followed up
on.
regards, tom lane
On Mon, Oct 14, 2024 at 5:15 PM Bruce Momjian <bruce@momjian.us> wrote:
I am not a fan of this patch. I don't see why _removing_ the magnetic
part helps because you then have no logic for any 1.2 was chosen.
Okay, but we have no documented logic on why 4.0 was chosen either. :)
I would put the magnetic in a separate paragraph perhaps, and recommend
4.0 for it.
Sounds doable. Even in the pre-SSD age I recall lowering this as a fairly
standard practice, but I'm fine with a recommendation of 4. Partly because
I doubt anyone will use it much.
Also, per-tablespace makes sense because of physical media
differences, but what purpose would per-database and per-role serve?
Also, per-tablespace is not a connection-activated item like the other
two.
Good point, I withdraw that part.
Cheers,
Greg
On Mon, Oct 14, 2024 at 10:20 PM David Rowley <dgrowleyml@gmail.com> wrote:
Yeah, I think any effort to change the default value for this setting
would require some analysis to prove that the newly proposed default
is a more suitable setting than the current default. I mean, why 1.2 and
not 1.1 or 1.3? Where's the evidence that 1.2 is the best value
for this?
As I said, I was just throwing that 1.2 number out there. It felt right,
although perhaps a tad high (which seems right as we keep things very
conservative). I agree we should make a best effort to have an accurate,
defendable default. We all know (I hope) that 4.0 is wrong for SSDs.
I don't think just providing evidence that random read times are closer to
sequential read times on SSDs are closer than they are with
HDDs is going to be enough.
...
It would be nice to have this as a script so that other people could
easily run it on their hardware to ensure that random_page_cost we
choose as the new default is representative of the average hardware.
Heh, this is starting to feel like belling the cat (see
https://fablesofaesop.com/belling-the-cat.html)
Remember this is still just a default, and we should encourage people to
tweak it themselves based on their own workloads. I just want people to
start in the right neighborhood. I'll see about working on some more
research / generating a script, but help from others is more than welcome.
Cheers,
Greg
On Fri, 25 Oct 2024 at 13:14, Greg Sabino Mullane <htamfids@gmail.com> wrote:
On Mon, Oct 14, 2024 at 10:20 PM David Rowley <dgrowleyml@gmail.com> wrote:
Yeah, I think any effort to change the default value for this setting would require some analysis to prove that the newly proposed default
is a more suitable setting than the current default. I mean, why 1.2 and not 1.1 or 1.3? Where's the evidence that 1.2 is the best value
for this?As I said, I was just throwing that 1.2 number out there. It felt right, although perhaps a tad high (which seems right as we keep things very conservative). I agree we should make a best effort to have an accurate, defendable default. We all know (I hope) that 4.0 is wrong for SSDs.
I don't think we're going to find the correct new value for this
setting by throwing randomly chosen numbers at each other on an email
thread. Unfortunately, someone is going to have to do some work to
figure out what the number should be, and then hopefully someone else
can verify that work to check that person is correct.
I'm not trying to be smart or funny here, but I just am failing to
comprehend why you think you offering a number without any information
about how you selected that number to set as the new default
random_page_cost would be acceptable. Are you expecting someone else
to go and do the work to prove that your selected number is the
correct one? It's been 4 weeks since your first email and nobody has
done that yet, so maybe you might need to consider other ways to
achieve your goal.
I don't think just providing evidence that random read times are closer to sequential read times on SSDs are closer than they are with
HDDs is going to be enough....
It would be nice to have this as a script so that other people could easily run it on their hardware to ensure that random_page_cost we
choose as the new default is representative of the average hardware.Heh, this is starting to feel like belling the cat (see https://fablesofaesop.com/belling-the-cat.html)
I don't see the similarity. Changing the default random_page_cost
requires analysis to find what the new default should be. The
execution of the actual change in default is dead simple. With
belling the cat, it seems like the execution is the hard part and
nobody is debating the idea itself.
Remember this is still just a default, and we should encourage people to tweak it themselves based on their own workloads. I just want people to start in the right neighborhood. I'll see about working on some more research / generating a script, but help from others is more than welcome.
You might be mistakenly thinking that the best random_page_cost is an
exact ratio of how much slower a random seek and read is from a
sequential read. There are unfortunately many other factors to
consider. The correct setting is going to be the one where the chosen
plan uses the scan method that's the fastest and knowing the answer to
that is going to take some benchmarks on PostgreSQL. Our cost model
simply just isn't perfect enough for you to assume that I/O is the
only factor that changes between an Index Scan and a Seq Scan.
I'd say it's not overly difficult to come up with test cases that go
to prove the value you select is "correct". I've done this before for
CPU-related costs. I think with I/O the main difference will be that
your tests should be much larger, and doing that will mean getting the
results takes much more time. Here's a link to some analysis I did to
help solve a problem relating to partition-wise aggregates [1]/messages/by-id/CAKJS1f9UXdk6ZYyqbJnjFO9a9hyHKGW7B=ZRh-rxy9qxfPA5Gw@mail.gmail.com. Maybe
you can use a similar method to determine random_page_cost.
David
[1]: /messages/by-id/CAKJS1f9UXdk6ZYyqbJnjFO9a9hyHKGW7B=ZRh-rxy9qxfPA5Gw@mail.gmail.com
HI Greg Sabino Mullane
Another thing is that you simply change the configuration template is
not effective,
need to modify the DEFAULT_RANDOM_PAGE_COST values
{
{"random_page_cost", PGC_USERSET, QUERY_TUNING_COST,
gettext_noop("Sets the planner's estimate of the cost of a "
"nonsequentially fetched disk page."),
NULL,
GUC_EXPLAIN
},
&random_page_cost,
DEFAULT_RANDOM_PAGE_COST, 0, DBL_MAX,
NULL, NULL, NULL
},
src/include/optimizer/cost.h
/* defaults for costsize.c's Cost parameters */
/* NB: cost-estimation code should use the variables, not these constants!
*/
/* If you change these, update backend/utils/misc/postgresql.conf.sample */
#define DEFAULT_SEQ_PAGE_COST 1.0
#define DEFAULT_RANDOM_PAGE_COST 4.0
#define DEFAULT_CPU_TUPLE_COST 0.01
#define DEFAULT_CPU_INDEX_TUPLE_COST 0.005
#define DEFAULT_CPU_OPERATOR_COST 0.0025
#define DEFAULT_PARALLEL_TUPLE_COST 0.1
#define DEFAULT_PARALLEL_SETUP_COST 1000.0
Thanks
Show quoted text
On Thu, Oct 24, 2024 at 08:01:11PM -0400, Greg Sabino Mullane wrote:
On Mon, Oct 14, 2024 at 5:15 PM Bruce Momjian <bruce@momjian.us> wrote:
I am not a fan of this patch. I don't see why _removing_ the magnetic
part helps because you then have no logic for any 1.2 was chosen.Okay, but we have no documented logic on why 4.0 was chosen either. :)
Uh, we do, and it is in the docs:
Random access to mechanical disk storage is normally much more expensive
than four times sequential access. However, a lower default is used
(4.0) because the majority of random accesses to disk, such as indexed
reads, are assumed to be in cache. The default value can be thought of
as modeling random access as 40 times slower than sequential, while
expecting 90% of random reads to be cached.
You surely saw this when you created the patch and removed the text.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"
Bruce Momjian <bruce@momjian.us> writes:
On Thu, Oct 24, 2024 at 08:01:11PM -0400, Greg Sabino Mullane wrote:
Okay, but we have no documented logic on why 4.0 was chosen either. :)
Uh, we do, and it is in the docs:
Random access to mechanical disk storage is normally much more expensive
than four times sequential access. However, a lower default is used
(4.0) because the majority of random accesses to disk, such as indexed
reads, are assumed to be in cache. The default value can be thought of
as modeling random access as 40 times slower than sequential, while
expecting 90% of random reads to be cached.
Meh. Reality is that that is somebody's long-after-the-fact apologia
for a number that was obtained by experimentation.
regards, tom lane
On Thu, Oct 31, 2024 at 1:43 PM Bruce Momjian <bruce@momjian.us> wrote:
I am not a fan of this patch. I don't see why _removing_ the
magnetic
part helps because you then have no logic for any 1.2 was chosen.
Okay, but we have no documented logic on why 4.0 was chosen either. :)
Uh, we do, and it is in the docs:
Random access to mechanical disk storage is normally much more
expensive
than four times sequential access. However, a lower default is
used
(4.0) because the majority of random accesses to disk, such as
indexed
reads, are assumed to be in cache. The default value can be
thought of
as modeling random access as 40 times slower than sequential, while
expecting 90% of random reads to be cached.You surely saw this when you created the patch and removed the text.
Yes, I did, but there is no documentation to support those numbers. Is it
really 40 times slower? Which hard disks, in what year? Where did the 90%
come from, and is that really an accurate average for production systems
with multiple caching layers? I know Tom ran actual tests many years ago,
but at the end of the day, all of these numbers will vary wildly depending
on a host of factors, so we have to make some educated guesses.
I guess my point was that my 1.2 (based on many years of tuning many client
systems) seems no less imprecise than 4.0 (based on actual tests many years
ago, with hardware that has changed a lot), for a default value that people
should tune for their specific system anyway.
Maybe these new tests people are asking for in this thread to determine a
better default rpc for SSDs can also help us determine a better rpc for HDs
as well.
Cheers,
Greg