Changing the default random_page_cost value

Started by Greg Sabino Mullaneover 1 year ago16 messages

greg@turnstep.com

over 1 year ago

tl;dr let's assume SSDs are popular and HDDs are the exception and flip our
default

As I write this email, it's the year 2024. I think it is time we lower our
"default" setting of random_page_cost (as set in postgresql.conf.sample and
the docs). Even a decade ago, the current default of 4 was considered
fairly conservative and often lowered. The git logs shows that this value
was last touched in 2006, during the age of spinning metal. We are now in a
new era, the age of SSDs, and thus we should lower this default value to
reflect the fact that the vast majority of people using Postgres these days
are doing so on solid state drives. We tend to stay ultra-conservative in
all of our settings, but we also need to recognize when there has been a
major shift in the underlying hardware - and calculations that our defaults
are based on.

Granted, there are other factors involved, and yes, perhaps we should tweak
some of the similar settings as well, but ranom_page_cost is the one
setting most out of sync with today's hardware realities. So I'll be brave
and throw a number out there: 1.2. And change our docs to say wordage like
"if you are using an older hard disk drive technology, you may want to try
raising rpc" to replace our fairly-hidden note about SSDs buried in the
last sentence - of the fourth paragraph - of the rpc docs.

Real data about performance on today's SSDs are welcome, and/or some way to
generate a more accurate default.

Cheers,
Greg

Roberto Mello

roberto.mello@gmail.com

over 1 year ago

In reply to: Greg Sabino Mullane (#1)

Re: Changing the default random_page_cost value

On Fri, Sep 27, 2024 at 8:07 AM Greg Sabino Mullane <htamfids@gmail.com>
wrote:

tl;dr let's assume SSDs are popular and HDDs are the exception and flip
our default

<snip>

Granted, there are other factors involved, and yes, perhaps we should
tweak some of the similar settings as well, but ranom_page_cost is the one
setting most out of sync with today's hardware realities. So I'll be brave
and throw a number out there: 1.2. And change our docs to say wordage like
"if you are using an older hard disk drive technology, you may want to try
raising rpc" to replace our fairly-hidden note about SSDs buried in the
last sentence - of the fourth paragraph - of the rpc docs.

I suggest a slightly nicer comment in the default conf file, like "For
spinning hard drives, raise this to at least 3 and test"

Roberto

Laurenz Albe

laurenz.albe@cybertec.at

over 1 year ago

In reply to: Greg Sabino Mullane (#1)

Re: Changing the default random_page_cost value

On Fri, 2024-09-27 at 10:07 -0400, Greg Sabino Mullane wrote:

So I'll be brave and throw a number out there: 1.2.

Laurenz Albe

Dagfinn Ilmari Mannsåker

ilmari@ilmari.org

over 1 year ago

In reply to: Greg Sabino Mullane (#1)

Re: Changing the default random_page_cost value

Greg Sabino Mullane <htamfids@gmail.com> writes:

So I'll be brave and throw a number out there: 1.2. And change our
docs to say wordage like "if you are using an older hard disk drive
technology, you may want to try raising rpc" to replace our
fairly-hidden note about SSDs buried in the last sentence - of the
fourth paragraph - of the rpc docs.

It might also be worth mentioning cloudy block storage (e.g. AWS' EBS),
which is typically backed by SSDs, but has extra network latency.

- ilmari

Greg Sabino Mullane

greg@turnstep.com

over 1 year ago

In reply to: Dagfinn Ilmari Mannsåker (#4)

Re: Changing the default random_page_cost value

On Fri, Sep 27, 2024 at 12:03 PM Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
wrote:

It might also be worth mentioning cloudy block storage (e.g. AWS' EBS),
which is typically backed by SSDs, but has extra network latency.

That seems a little too in the weeds for me, but wording suggestions are
welcome. To get things moving forward, I made a doc patch which changes a
few things, namely:

* Mentions the distinction between ssd and hdd right up front.
* Moves the tablespace talk to the very end, as tablespace use is getting
rarer (again, thanks in part to ssds)
* Mentions the capability to set per-database and per-role since we mention
per-tablespace.
* Removes a lot of the talk of caches and justifications for the 4.0
setting. While those are interesting, I've been tuning this parameter for
many years and never really cared about the "90% cache rate". The proof is
in the pudding: rpc is the canonical "try it and see" parameter. Tweak.
Test. Repeat.

Cheers,
Greg

Bruce Momjian

bruce@momjian.us

over 1 year ago

In reply to: Greg Sabino Mullane (#5)

Re: Changing the default random_page_cost value

On Mon, Sep 30, 2024 at 10:05:29AM -0400, Greg Sabino Mullane wrote:

On Fri, Sep 27, 2024 at 12:03 PM Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
wrote:

It might also be worth mentioning cloudy block storage (e.g. AWS' EBS),
which is typically backed by SSDs, but has extra network latency.

That seems a little too in the weeds for me, but wording suggestions are
welcome. To get things moving forward, I made a doc patch which changes a few
things, namely:

* Mentions the distinction between ssd and hdd right up front.
* Moves the tablespace talk to the very end, as tablespace use is getting rarer
(again, thanks in part to ssds)
* Mentions the capability to set per-database and per-role since we mention
per-tablespace.
* Removes a lot of the talk of caches and justifications for the 4.0 setting.
While those are interesting, I've been tuning this parameter for many years and
never really cared about the "90% cache rate". The proof is in the pudding: rpc
is the canonical "try it and see" parameter. Tweak. Test. Repeat.

I am not a fan of this patch. I don't see why _removing_ the magnetic
part helps because you then have no logic for any 1.2 was chosen. I
would put the magnetic in a separate paragraph perhaps, and recommend
4.0 for it. Also, per-tablespace makes sense because of physical media
differences, but what purpose would per-database and per-role serve?
Also, per-tablespace is not a connection-activated item like the other
two.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"

David Rowley

dgrowleyml@gmail.com

over 1 year ago

In reply to: Bruce Momjian (#6)

Re: Changing the default random_page_cost value

On Tue, 15 Oct 2024 at 10:15, Bruce Momjian <bruce@momjian.us> wrote:

I am not a fan of this patch. I don't see why _removing_ the magnetic
part helps because you then have no logic for any 1.2 was chosen. I
would put the magnetic in a separate paragraph perhaps, and recommend
4.0 for it. Also, per-tablespace makes sense because of physical media
differences, but what purpose would per-database and per-role serve?
Also, per-tablespace is not a connection-activated item like the other
two.

Yeah, I think any effort to change the default value for this setting
would require some analysis to prove that the newly proposed default
is a more suitable setting than the current default. I mean, why 1.2
and not 1.1 or 1.3? Where's the evidence that 1.2 is the best value
for this?

I don't think just providing evidence that random read times are
closer to sequential read times on SSDs are closer than they are with
HDDs is going to be enough. What we want to know is if the planner
costs become more closely related to the execution time or not. From
my experience, random_page_cost really only has a loose grasp on
reality, so you might find that it's hard to prove this with any
degree of confidence (just have a look at how inconsiderate
index_pages_fetched() is to other queries running on the database, for
example).

I suggest first identifying all the locations that use
random_page_cost then coming up with some test cases that run queries
that exercise those locations in a way that does things like vary the
actual selectivity of some value to have the planner switch plans then
try varying the random_page_cost to show that the switchover point is
more correct with the new value than it is with the old value. It
would be nice to have this as a script so that other people could
easily run it on their hardware to ensure that random_page_cost we
choose as the new default is representative of the average hardware.
You'll likely need to do this with varying index sizes. I imagine to
properly test this so that we'd have any useful degree of confidence
that the new value is better than the old one would likely require a
benchmark that runs for several hours. At the upper end, you'd likely
want the data sizes to exceed the size of RAM. Another dimension that
the tests should likely explore is varying data locality.

David

Tom Lane

tgl@sss.pgh.pa.us

over 1 year ago

In reply to: David Rowley (#7)

Re: Changing the default random_page_cost value

David Rowley <dgrowleyml@gmail.com> writes:

Yeah, I think any effort to change the default value for this setting
would require some analysis to prove that the newly proposed default
is a more suitable setting than the current default. I mean, why 1.2
and not 1.1 or 1.3? Where's the evidence that 1.2 is the best value
for this?

Yeah, that's been my main concern about this proposal too.

I recall that when we settled on 4.0 as a good number for
spinning-rust drives, it came out of some experimentation that
I'd done that involved multiple-day-long tests. I don't recall any
more details than that sadly, but perhaps trawling the mailing list
archives would yield useful info. It looks like the 4.0 value came
in with b1577a7c7 of 2000-02-15, so late 1999/early 2000 would be the
time frame to look in.

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

over 1 year ago

In reply to: Tom Lane (#8)

Re: Changing the default random_page_cost value

I wrote:

I recall that when we settled on 4.0 as a good number for
spinning-rust drives, it came out of some experimentation that
I'd done that involved multiple-day-long tests. I don't recall any
more details than that sadly, but perhaps trawling the mailing list
archives would yield useful info. It looks like the 4.0 value came
in with b1577a7c7 of 2000-02-15, so late 1999/early 2000 would be the
time frame to look in.

I tried asking https://www.postgresql.org/search/ about
random_page_cost, and got nothing except search engine timeouts :-(.
However, some digging in my own local archives yielded

/messages/by-id/25387.948414692@sss.pgh.pa.us

/messages/by-id/14601.949786166@sss.pgh.pa.us

That confirms my recollection of multiple-day test runs, but doesn't
offer much more useful detail than that :-(. What I think I did
though was to create some large tables (much bigger than the RAM on
the machine I had) and actually measure the runtime of seq scans
versus full-table index scans, repeating plenty 'o times to try to
average out the noise. There was some talk in those threads of
reducing that to a publishable script, but it was never followed up
on.

regards, tom lane

#10

Greg Sabino Mullane

greg@turnstep.com

over 1 year ago

In reply to: Bruce Momjian (#6)

Re: Changing the default random_page_cost value

On Mon, Oct 14, 2024 at 5:15 PM Bruce Momjian <bruce@momjian.us> wrote:

I am not a fan of this patch. I don't see why _removing_ the magnetic
part helps because you then have no logic for any 1.2 was chosen.

Okay, but we have no documented logic on why 4.0 was chosen either. :)

I would put the magnetic in a separate paragraph perhaps, and recommend

4.0 for it.

Sounds doable. Even in the pre-SSD age I recall lowering this as a fairly
standard practice, but I'm fine with a recommendation of 4. Partly because
I doubt anyone will use it much.

Also, per-tablespace makes sense because of physical media

differences, but what purpose would per-database and per-role serve?
Also, per-tablespace is not a connection-activated item like the other
two.

Good point, I withdraw that part.

Cheers,
Greg

#11

Greg Sabino Mullane

greg@turnstep.com

over 1 year ago

In reply to: David Rowley (#7)

Re: Changing the default random_page_cost value

On Mon, Oct 14, 2024 at 10:20 PM David Rowley <dgrowleyml@gmail.com> wrote:

Yeah, I think any effort to change the default value for this setting
would require some analysis to prove that the newly proposed default
is a more suitable setting than the current default. I mean, why 1.2 and
not 1.1 or 1.3? Where's the evidence that 1.2 is the best value
for this?

As I said, I was just throwing that 1.2 number out there. It felt right,
although perhaps a tad high (which seems right as we keep things very
conservative). I agree we should make a best effort to have an accurate,
defendable default. We all know (I hope) that 4.0 is wrong for SSDs.

I don't think just providing evidence that random read times are closer to
sequential read times on SSDs are closer than they are with
HDDs is going to be enough.

...

It would be nice to have this as a script so that other people could
easily run it on their hardware to ensure that random_page_cost we
choose as the new default is representative of the average hardware.

Heh, this is starting to feel like belling the cat (see
https://fablesofaesop.com/belling-the-cat.html)

Remember this is still just a default, and we should encourage people to
tweak it themselves based on their own workloads. I just want people to
start in the right neighborhood. I'll see about working on some more
research / generating a script, but help from others is more than welcome.

Cheers,
Greg

#12

David Rowley

dgrowleyml@gmail.com

over 1 year ago

In reply to: Greg Sabino Mullane (#11)

Re: Changing the default random_page_cost value

On Fri, 25 Oct 2024 at 13:14, Greg Sabino Mullane <htamfids@gmail.com> wrote:

On Mon, Oct 14, 2024 at 10:20 PM David Rowley <dgrowleyml@gmail.com> wrote:

Yeah, I think any effort to change the default value for this setting would require some analysis to prove that the newly proposed default
is a more suitable setting than the current default. I mean, why 1.2 and not 1.1 or 1.3? Where's the evidence that 1.2 is the best value
for this?

As I said, I was just throwing that 1.2 number out there. It felt right, although perhaps a tad high (which seems right as we keep things very conservative). I agree we should make a best effort to have an accurate, defendable default. We all know (I hope) that 4.0 is wrong for SSDs.

I don't think we're going to find the correct new value for this
setting by throwing randomly chosen numbers at each other on an email
thread. Unfortunately, someone is going to have to do some work to
figure out what the number should be, and then hopefully someone else
can verify that work to check that person is correct.

I'm not trying to be smart or funny here, but I just am failing to
comprehend why you think you offering a number without any information
about how you selected that number to set as the new default
random_page_cost would be acceptable. Are you expecting someone else
to go and do the work to prove that your selected number is the
correct one? It's been 4 weeks since your first email and nobody has
done that yet, so maybe you might need to consider other ways to
achieve your goal.

I don't think just providing evidence that random read times are closer to sequential read times on SSDs are closer than they are with
HDDs is going to be enough.

...

It would be nice to have this as a script so that other people could easily run it on their hardware to ensure that random_page_cost we
choose as the new default is representative of the average hardware.

Heh, this is starting to feel like belling the cat (see https://fablesofaesop.com/belling-the-cat.html)

I don't see the similarity. Changing the default random_page_cost
requires analysis to find what the new default should be. The
execution of the actual change in default is dead simple. With
belling the cat, it seems like the execution is the hard part and
nobody is debating the idea itself.

Remember this is still just a default, and we should encourage people to tweak it themselves based on their own workloads. I just want people to start in the right neighborhood. I'll see about working on some more research / generating a script, but help from others is more than welcome.

You might be mistakenly thinking that the best random_page_cost is an
exact ratio of how much slower a random seek and read is from a
sequential read. There are unfortunately many other factors to
consider. The correct setting is going to be the one where the chosen
plan uses the scan method that's the fastest and knowing the answer to
that is going to take some benchmarks on PostgreSQL. Our cost model
simply just isn't perfect enough for you to assume that I/O is the
only factor that changes between an Index Scan and a Seq Scan.

I'd say it's not overly difficult to come up with test cases that go
to prove the value you select is "correct". I've done this before for
CPU-related costs. I think with I/O the main difference will be that
your tests should be much larger, and doing that will mean getting the
results takes much more time. Here's a link to some analysis I did to
help solve a problem relating to partition-wise aggregates [1]/messages/by-id/CAKJS1f9UXdk6ZYyqbJnjFO9a9hyHKGW7B=ZRh-rxy9qxfPA5Gw@mail.gmail.com. Maybe
you can use a similar method to determine random_page_cost.

David

[1]: /messages/by-id/CAKJS1f9UXdk6ZYyqbJnjFO9a9hyHKGW7B=ZRh-rxy9qxfPA5Gw@mail.gmail.com

#13

wenhui qiu

qiuwenhuifx@gmail.com

over 1 year ago

In reply to: David Rowley (#12)

Re: Changing the default random_page_cost value

HI Greg Sabino Mullane
Another thing is that you simply change the configuration template is
not effective,

need to modify the DEFAULT_RANDOM_PAGE_COST values
{
{"random_page_cost", PGC_USERSET, QUERY_TUNING_COST,
gettext_noop("Sets the planner's estimate of the cost of a "
"nonsequentially fetched disk page."),
NULL,
GUC_EXPLAIN
},
&random_page_cost,
DEFAULT_RANDOM_PAGE_COST, 0, DBL_MAX,
NULL, NULL, NULL
},

src/include/optimizer/cost.h
/* defaults for costsize.c's Cost parameters */
/* NB: cost-estimation code should use the variables, not these constants!
*/
/* If you change these, update backend/utils/misc/postgresql.conf.sample */
#define DEFAULT_SEQ_PAGE_COST 1.0
#define DEFAULT_RANDOM_PAGE_COST 4.0
#define DEFAULT_CPU_TUPLE_COST 0.01
#define DEFAULT_CPU_INDEX_TUPLE_COST 0.005
#define DEFAULT_CPU_OPERATOR_COST 0.0025
#define DEFAULT_PARALLEL_TUPLE_COST 0.1
#define DEFAULT_PARALLEL_SETUP_COST 1000.0

Thanks

Show quoted text

#14

Bruce Momjian

bruce@momjian.us

over 1 year ago

In reply to: Greg Sabino Mullane (#10)

Re: Changing the default random_page_cost value

On Thu, Oct 24, 2024 at 08:01:11PM -0400, Greg Sabino Mullane wrote:

On Mon, Oct 14, 2024 at 5:15 PM Bruce Momjian <bruce@momjian.us> wrote:

I am not a fan of this patch. I don't see why _removing_ the magnetic
part helps because you then have no logic for any 1.2 was chosen.

Okay, but we have no documented logic on why 4.0 was chosen either. :)

Uh, we do, and it is in the docs:

Random access to mechanical disk storage is normally much more expensive
than four times sequential access. However, a lower default is used
(4.0) because the majority of random accesses to disk, such as indexed
reads, are assumed to be in cache. The default value can be thought of
as modeling random access as 40 times slower than sequential, while
expecting 90% of random reads to be cached.

You surely saw this when you created the patch and removed the text.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"

#15

Tom Lane

tgl@sss.pgh.pa.us

over 1 year ago

In reply to: Bruce Momjian (#14)

Re: Changing the default random_page_cost value

Bruce Momjian <bruce@momjian.us> writes:

On Thu, Oct 24, 2024 at 08:01:11PM -0400, Greg Sabino Mullane wrote:

Okay, but we have no documented logic on why 4.0 was chosen either. :)

Uh, we do, and it is in the docs:

Random access to mechanical disk storage is normally much more expensive
than four times sequential access. However, a lower default is used
(4.0) because the majority of random accesses to disk, such as indexed
reads, are assumed to be in cache. The default value can be thought of
as modeling random access as 40 times slower than sequential, while
expecting 90% of random reads to be cached.

Meh. Reality is that that is somebody's long-after-the-fact apologia
for a number that was obtained by experimentation.

regards, tom lane

#16

Greg Sabino Mullane

greg@turnstep.com

over 1 year ago

In reply to: Bruce Momjian (#14)

Re: Changing the default random_page_cost value

On Thu, Oct 31, 2024 at 1:43 PM Bruce Momjian <bruce@momjian.us> wrote:

I am not a fan of this patch. I don't see why _removing_ the

magnetic

part helps because you then have no logic for any 1.2 was chosen.

Okay, but we have no documented logic on why 4.0 was chosen either. :)

Uh, we do, and it is in the docs:

Random access to mechanical disk storage is normally much more
expensive
than four times sequential access. However, a lower default is
used
(4.0) because the majority of random accesses to disk, such as
indexed
reads, are assumed to be in cache. The default value can be
thought of
as modeling random access as 40 times slower than sequential, while
expecting 90% of random reads to be cached.

You surely saw this when you created the patch and removed the text.

Yes, I did, but there is no documentation to support those numbers. Is it
really 40 times slower? Which hard disks, in what year? Where did the 90%
come from, and is that really an accurate average for production systems
with multiple caching layers? I know Tom ran actual tests many years ago,
but at the end of the day, all of these numbers will vary wildly depending
on a host of factors, so we have to make some educated guesses.

I guess my point was that my 1.2 (based on many years of tuning many client
systems) seems no less imprecise than 4.0 (based on actual tests many years
ago, with hardware that has changed a lot), for a default value that people
should tune for their specific system anyway.

Maybe these new tests people are asking for in this thread to determine a
better default rpc for SSDs can also help us determine a better rpc for HDs
as well.

Cheers,
Greg

Changing the default random_page_cost value

Attachments: