Should we update the random_page_cost default value?

Started by Tomas Vondra9 months ago68 messageshackers

tomas.vondra@2ndquadrant.com

9 months ago

Hi,

I wonder if it's time to consider updating the random_page_cost default
value. There are multiple reasons to maybe do that.

The GUC (and the 4.0 default) was introduced in ~2000 [1]/messages/by-id/14601.949786166@sss.pgh.pa.us, so ~25 years
ago. During that time the world went from rotational drives through
multiple generations of flash / network-attached storage. I find it hard
to believe those changes wouldn't affect random_page_cost.

And indeed, it's common to advice to reduce the GUC closer to 1.0 on
SSDs. I myself recommended doing that in the past, but over time I got
somewhat skeptical about it. The advice is based on the "obvious" wisdom
that SSDs are much better in handling random I/O than rotational disks.

But this has two flaws. First, it assumes the current 4.0 default makes
sense - maybe it doesn't and then it's useless as a "starting point".
Second, it's not obvious how much better SSDs are without concurrent IOs
(which is needed to fully leverage the SSDs). Which we don't do for
index scans (yet), and even if we did, the cost model has no concept for
such concurrency.

Recently, I've been doing some experiments evaluating how often we pick
an optimal scan for simple select queries, assuming accurate estimates.
Turns out we pick the wrong plan fairly often, even with almost perfect
estimates. I somewhat expected that, with the default random_page_cost
value. What did surprise me was that to improve the plans, I had to
*increase* the value, even on really new/fast SSDs ...

I started looking at how we calculated the 4.0 default back in 2000.
Unfortunately, there's a lot of info, as Tom pointed out in 2024 [2]/messages/by-id/3866858.1728961439@sss.pgh.pa.us.
But he outlined how the experiment worked:

- generate large table (much bigger than RAM)
- measure runtime of seq scan
- measure runtime of full-table index scan
- calculate how much more expensive a random page access is

So I decided to try doing this on a couple different devices, and see
what random_page_cost values that gives me. Attached is a script doing
such benchmark:

(1) initializes a new cluster, with a couple parameters adjusted

(2) creates a random table (with uniform distribution)

(3) runs a sequential scan

SELECT * FROM (SELECT * FROM test_table) OFFSET $nrows;

(4) runs an index scan

SELECT * FROM (SELECT * FROM test_table ORDER BY id) OFFSET $nrows;

The script does a couple things to force the query plan, and it reports
timings of the two queries at the end.

I've been running this on machines with 64GB of RAM, so I chose the
table to have 500M rows. With fillfactor=20 that means a ~182GB table
(23809524 pages).

Lets say that

T(seq) = timing of the seqscan query
T(idx) = timing of the index scan
IOS(seq) = number of sequential page reads
IOS(idx) = number of random page reads
P = number of pages
N = number of rows

then time to read a sequential page (because IOS(seq) == P)

PT(seq) = T(seq) / IOS(seq) = T(seq) / P

and time to read a random page (assuming the table is perfectly random,
with no cache hits):

PT(idx) = T(idx) / IOS(idx) = T(idx) / N

which gives us the "idea" random page cost (as a value relative to
reading a page sequentially)

random_page_cost = PT(idx) / PT(seq)

T(idx) * P
= --------------
T(seq) * N

The "no cache hits" is not quite correct, with 182GB there's about 30%
of a page being in memory. I didn't think of using debug_io_direct at
the time, but it doesn't affect the conclusion very much (in fact, it
would *increase* the random_page_cost value, which makes it worse).

I did this on the three SSDs I use for testing - 4x NVMe RAID0, 4x SATA
RAID0, and a single NVMe drive. Here's the results:

seqscan (s) index scan (s) random_page_cost
-----------------------------------------------------------------
NVMe 98 42232 20.4
NVMe/RAID0 24 25462 49.3
SATA/RAID0 109 48141 21.0

These are reasonably good SSDs, and yet the "correct" random_page cost
comes out about 5-10x of our default.

FWIW I double checked the test is actually I/O bound. The CPU usage
never goes over ~50% (not even in the the seqscan case).

These calculated values also align with the "optimal" plan choice, i.e.
the plans flip much closer to the actual crossing point (compared to
where it'd flip with 4.0).

It obviously contradicts the advice to set the value closer to 1.0. But
why is that? SSDs are certainly better with random I/0, even if the I/O
is not concurrent and the SSD is not fully utilized. So the 4.0 seems
off, the value should be higher than what we got for SSDs ...

I don't have any rotational devices in my test machines anymore, but I
got an azure VM with local "SCSI" disk, and with "standard HDD" volume.
And I got this (this is with 10M rows, ~3.7GB, with direct I/O):

seqscan (s) index scan (s) random_page_cost
-----------------------------------------------------------------
SCSI 2.1 1292 28.5
standard HDD 209.8 27586 62.6

I suspect the SCSI disk is not actually rotational (or which model), it
seems more like an SSD with SCSI interface or what model is that. The
"standard HDD" seems much closer to rotational, with ~370 IOPS (it'd
take ages to do the index scan on more than 10M rows).

Unless I did some silly mistakes, these results suggest the current 4.0
value is a bit too low, and something like ~20 would be better even on
SSDs. This is not the first time it was suggested a higher default might
be better - see this 2008 post [3]/messages/by-id/23625.1223642230@sss.pgh.pa.us. Of course, that's from before SSDs
became a thing, it's about evolution in hard disks and our code.

However, it also says this:

Yeah, it seems like raising random_page_cost is not something we
ever recommend in practice. I suspect what we'd really need here to
make any progress is a more detailed cost model, not just fooling
with the parameters of the existing one.

I don't quite follow the reasoning. If increasing the cost model would
require making the cost model mode detailed, why wouldn't the same thing
apply for lowering it? I don't see a reason for asymmetry.

Also, I intentionally used a table with "perfectly random" data, because
that's about the simplest thing to estimate, and it indeed makes all the
estimates almost perfect (including the internal ones in cost_index). If
we can't cost such simple cases correctly, what's the point of costing?

From a robustness point of view, wouldn't it be better to actually err
on the side of using a higher random_page_cost value? That'd mean we
flip to "more-sequential" scans sooner, with much "flatter" behavior.
That doesn't need to be a seqscan (which is about as flat as it gets),
but e.g. a bitmap scan - which probably silently "fixes" many cases
where the index scan gets costed too low.

It also says this:

And the value of 4 seems to work well in practice.

I wonder how do we know that? Most users don't experiment with different
values very much. They just run with the default, or maybe even lower
it, based on some recommendation. But they don't run the same query with
different values, so they can't spot differences unless they hit a
particularly bad plan.

Of course, it's also true most workloads tend to access well cached
data, which makes errors much cheaper. Or maybe just queries with the
"problematic selectivities" are not that common. Still, even if it
doesn't change the scan choice, it seems important to keep the cost
somewhat closer to reality because of the plan nodes above ...

It seems to me the current default is a bit too low, but changing a GUC
this important is not trivial. So what should we do about it?

regards

[1]: /messages/by-id/14601.949786166@sss.pgh.pa.us

[2]: /messages/by-id/3866858.1728961439@sss.pgh.pa.us

[3]: /messages/by-id/23625.1223642230@sss.pgh.pa.us

--
Tomas Vondra