Should we update the random_page_cost default value?

Started by Tomas Vondra5 months ago68 messages
Jump to latest
#1Tomas Vondra
tomas.vondra@2ndquadrant.com

Hi,

I wonder if it's time to consider updating the random_page_cost default
value. There are multiple reasons to maybe do that.

The GUC (and the 4.0 default) was introduced in ~2000 [1]/messages/by-id/14601.949786166@sss.pgh.pa.us, so ~25 years
ago. During that time the world went from rotational drives through
multiple generations of flash / network-attached storage. I find it hard
to believe those changes wouldn't affect random_page_cost.

And indeed, it's common to advice to reduce the GUC closer to 1.0 on
SSDs. I myself recommended doing that in the past, but over time I got
somewhat skeptical about it. The advice is based on the "obvious" wisdom
that SSDs are much better in handling random I/O than rotational disks.

But this has two flaws. First, it assumes the current 4.0 default makes
sense - maybe it doesn't and then it's useless as a "starting point".
Second, it's not obvious how much better SSDs are without concurrent IOs
(which is needed to fully leverage the SSDs). Which we don't do for
index scans (yet), and even if we did, the cost model has no concept for
such concurrency.

Recently, I've been doing some experiments evaluating how often we pick
an optimal scan for simple select queries, assuming accurate estimates.
Turns out we pick the wrong plan fairly often, even with almost perfect
estimates. I somewhat expected that, with the default random_page_cost
value. What did surprise me was that to improve the plans, I had to
*increase* the value, even on really new/fast SSDs ...

I started looking at how we calculated the 4.0 default back in 2000.
Unfortunately, there's a lot of info, as Tom pointed out in 2024 [2]/messages/by-id/3866858.1728961439@sss.pgh.pa.us.
But he outlined how the experiment worked:

- generate large table (much bigger than RAM)
- measure runtime of seq scan
- measure runtime of full-table index scan
- calculate how much more expensive a random page access is

So I decided to try doing this on a couple different devices, and see
what random_page_cost values that gives me. Attached is a script doing
such benchmark:

(1) initializes a new cluster, with a couple parameters adjusted

(2) creates a random table (with uniform distribution)

(3) runs a sequential scan

SELECT * FROM (SELECT * FROM test_table) OFFSET $nrows;

(4) runs an index scan

SELECT * FROM (SELECT * FROM test_table ORDER BY id) OFFSET $nrows;

The script does a couple things to force the query plan, and it reports
timings of the two queries at the end.

I've been running this on machines with 64GB of RAM, so I chose the
table to have 500M rows. With fillfactor=20 that means a ~182GB table
(23809524 pages).

Lets say that

T(seq) = timing of the seqscan query
T(idx) = timing of the index scan
IOS(seq) = number of sequential page reads
IOS(idx) = number of random page reads
P = number of pages
N = number of rows

then time to read a sequential page (because IOS(seq) == P)

PT(seq) = T(seq) / IOS(seq) = T(seq) / P

and time to read a random page (assuming the table is perfectly random,
with no cache hits):

PT(idx) = T(idx) / IOS(idx) = T(idx) / N

which gives us the "idea" random page cost (as a value relative to
reading a page sequentially)

random_page_cost = PT(idx) / PT(seq)

T(idx) * P
= --------------
T(seq) * N

The "no cache hits" is not quite correct, with 182GB there's about 30%
of a page being in memory. I didn't think of using debug_io_direct at
the time, but it doesn't affect the conclusion very much (in fact, it
would *increase* the random_page_cost value, which makes it worse).

I did this on the three SSDs I use for testing - 4x NVMe RAID0, 4x SATA
RAID0, and a single NVMe drive. Here's the results:

seqscan (s) index scan (s) random_page_cost
-----------------------------------------------------------------
NVMe 98 42232 20.4
NVMe/RAID0 24 25462 49.3
SATA/RAID0 109 48141 21.0

These are reasonably good SSDs, and yet the "correct" random_page cost
comes out about 5-10x of our default.

FWIW I double checked the test is actually I/O bound. The CPU usage
never goes over ~50% (not even in the the seqscan case).

These calculated values also align with the "optimal" plan choice, i.e.
the plans flip much closer to the actual crossing point (compared to
where it'd flip with 4.0).

It obviously contradicts the advice to set the value closer to 1.0. But
why is that? SSDs are certainly better with random I/0, even if the I/O
is not concurrent and the SSD is not fully utilized. So the 4.0 seems
off, the value should be higher than what we got for SSDs ...

I don't have any rotational devices in my test machines anymore, but I
got an azure VM with local "SCSI" disk, and with "standard HDD" volume.
And I got this (this is with 10M rows, ~3.7GB, with direct I/O):

seqscan (s) index scan (s) random_page_cost
-----------------------------------------------------------------
SCSI 2.1 1292 28.5
standard HDD 209.8 27586 62.6

I suspect the SCSI disk is not actually rotational (or which model), it
seems more like an SSD with SCSI interface or what model is that. The
"standard HDD" seems much closer to rotational, with ~370 IOPS (it'd
take ages to do the index scan on more than 10M rows).

Unless I did some silly mistakes, these results suggest the current 4.0
value is a bit too low, and something like ~20 would be better even on
SSDs. This is not the first time it was suggested a higher default might
be better - see this 2008 post [3]/messages/by-id/23625.1223642230@sss.pgh.pa.us. Of course, that's from before SSDs
became a thing, it's about evolution in hard disks and our code.

However, it also says this:

Yeah, it seems like raising random_page_cost is not something we
ever recommend in practice. I suspect what we'd really need here to
make any progress is a more detailed cost model, not just fooling
with the parameters of the existing one.

I don't quite follow the reasoning. If increasing the cost model would
require making the cost model mode detailed, why wouldn't the same thing
apply for lowering it? I don't see a reason for asymmetry.

Also, I intentionally used a table with "perfectly random" data, because
that's about the simplest thing to estimate, and it indeed makes all the
estimates almost perfect (including the internal ones in cost_index). If
we can't cost such simple cases correctly, what's the point of costing?

From a robustness point of view, wouldn't it be better to actually err
on the side of using a higher random_page_cost value? That'd mean we
flip to "more-sequential" scans sooner, with much "flatter" behavior.
That doesn't need to be a seqscan (which is about as flat as it gets),
but e.g. a bitmap scan - which probably silently "fixes" many cases
where the index scan gets costed too low.

It also says this:

And the value of 4 seems to work well in practice.

I wonder how do we know that? Most users don't experiment with different
values very much. They just run with the default, or maybe even lower
it, based on some recommendation. But they don't run the same query with
different values, so they can't spot differences unless they hit a
particularly bad plan.

Of course, it's also true most workloads tend to access well cached
data, which makes errors much cheaper. Or maybe just queries with the
"problematic selectivities" are not that common. Still, even if it
doesn't change the scan choice, it seems important to keep the cost
somewhat closer to reality because of the plan nodes above ...

It seems to me the current default is a bit too low, but changing a GUC
this important is not trivial. So what should we do about it?

regards

[1]: /messages/by-id/14601.949786166@sss.pgh.pa.us

[2]: /messages/by-id/3866858.1728961439@sss.pgh.pa.us

[3]: /messages/by-id/23625.1223642230@sss.pgh.pa.us

--
Tomas Vondra

Attachments:

run.shapplication/x-shellscript; name=run.shDownload
#2wenhui qiu
qiuwenhuifx@gmail.com
In reply to: Tomas Vondra (#1)
Re: Should we update the random_page_cost default value?

Hi Tomas

I really can't agree more. Many default values are just too conservative,
and the documentation doesn't provide best practices.,i think reduce to
1.x,Or add a tip in the document, providing a recommended value for
different SSDs.

On Mon, 6 Oct 2025 at 08:59, Tomas Vondra <tomas@vondra.me> wrote:

Show quoted text

Hi,

I wonder if it's time to consider updating the random_page_cost default
value. There are multiple reasons to maybe do that.

The GUC (and the 4.0 default) was introduced in ~2000 [1], so ~25 years
ago. During that time the world went from rotational drives through
multiple generations of flash / network-attached storage. I find it hard
to believe those changes wouldn't affect random_page_cost.

And indeed, it's common to advice to reduce the GUC closer to 1.0 on
SSDs. I myself recommended doing that in the past, but over time I got
somewhat skeptical about it. The advice is based on the "obvious" wisdom
that SSDs are much better in handling random I/O than rotational disks.

But this has two flaws. First, it assumes the current 4.0 default makes
sense - maybe it doesn't and then it's useless as a "starting point".
Second, it's not obvious how much better SSDs are without concurrent IOs
(which is needed to fully leverage the SSDs). Which we don't do for
index scans (yet), and even if we did, the cost model has no concept for
such concurrency.

Recently, I've been doing some experiments evaluating how often we pick
an optimal scan for simple select queries, assuming accurate estimates.
Turns out we pick the wrong plan fairly often, even with almost perfect
estimates. I somewhat expected that, with the default random_page_cost
value. What did surprise me was that to improve the plans, I had to
*increase* the value, even on really new/fast SSDs ...

I started looking at how we calculated the 4.0 default back in 2000.
Unfortunately, there's a lot of info, as Tom pointed out in 2024 [2].
But he outlined how the experiment worked:

- generate large table (much bigger than RAM)
- measure runtime of seq scan
- measure runtime of full-table index scan
- calculate how much more expensive a random page access is

So I decided to try doing this on a couple different devices, and see
what random_page_cost values that gives me. Attached is a script doing
such benchmark:

(1) initializes a new cluster, with a couple parameters adjusted

(2) creates a random table (with uniform distribution)

(3) runs a sequential scan

SELECT * FROM (SELECT * FROM test_table) OFFSET $nrows;

(4) runs an index scan

SELECT * FROM (SELECT * FROM test_table ORDER BY id) OFFSET $nrows;

The script does a couple things to force the query plan, and it reports
timings of the two queries at the end.

I've been running this on machines with 64GB of RAM, so I chose the
table to have 500M rows. With fillfactor=20 that means a ~182GB table
(23809524 pages).

Lets say that

T(seq) = timing of the seqscan query
T(idx) = timing of the index scan
IOS(seq) = number of sequential page reads
IOS(idx) = number of random page reads
P = number of pages
N = number of rows

then time to read a sequential page (because IOS(seq) == P)

PT(seq) = T(seq) / IOS(seq) = T(seq) / P

and time to read a random page (assuming the table is perfectly random,
with no cache hits):

PT(idx) = T(idx) / IOS(idx) = T(idx) / N

which gives us the "idea" random page cost (as a value relative to
reading a page sequentially)

random_page_cost = PT(idx) / PT(seq)

T(idx) * P
= --------------
T(seq) * N

The "no cache hits" is not quite correct, with 182GB there's about 30%
of a page being in memory. I didn't think of using debug_io_direct at
the time, but it doesn't affect the conclusion very much (in fact, it
would *increase* the random_page_cost value, which makes it worse).

I did this on the three SSDs I use for testing - 4x NVMe RAID0, 4x SATA
RAID0, and a single NVMe drive. Here's the results:

seqscan (s) index scan (s) random_page_cost
-----------------------------------------------------------------
NVMe 98 42232 20.4
NVMe/RAID0 24 25462 49.3
SATA/RAID0 109 48141 21.0

These are reasonably good SSDs, and yet the "correct" random_page cost
comes out about 5-10x of our default.

FWIW I double checked the test is actually I/O bound. The CPU usage
never goes over ~50% (not even in the the seqscan case).

These calculated values also align with the "optimal" plan choice, i.e.
the plans flip much closer to the actual crossing point (compared to
where it'd flip with 4.0).

It obviously contradicts the advice to set the value closer to 1.0. But
why is that? SSDs are certainly better with random I/0, even if the I/O
is not concurrent and the SSD is not fully utilized. So the 4.0 seems
off, the value should be higher than what we got for SSDs ...

I don't have any rotational devices in my test machines anymore, but I
got an azure VM with local "SCSI" disk, and with "standard HDD" volume.
And I got this (this is with 10M rows, ~3.7GB, with direct I/O):

seqscan (s) index scan (s) random_page_cost
-----------------------------------------------------------------
SCSI 2.1 1292 28.5
standard HDD 209.8 27586 62.6

I suspect the SCSI disk is not actually rotational (or which model), it
seems more like an SSD with SCSI interface or what model is that. The
"standard HDD" seems much closer to rotational, with ~370 IOPS (it'd
take ages to do the index scan on more than 10M rows).

Unless I did some silly mistakes, these results suggest the current 4.0
value is a bit too low, and something like ~20 would be better even on
SSDs. This is not the first time it was suggested a higher default might
be better - see this 2008 post [3]. Of course, that's from before SSDs
became a thing, it's about evolution in hard disks and our code.

However, it also says this:

Yeah, it seems like raising random_page_cost is not something we
ever recommend in practice. I suspect what we'd really need here to
make any progress is a more detailed cost model, not just fooling
with the parameters of the existing one.

I don't quite follow the reasoning. If increasing the cost model would
require making the cost model mode detailed, why wouldn't the same thing
apply for lowering it? I don't see a reason for asymmetry.

Also, I intentionally used a table with "perfectly random" data, because
that's about the simplest thing to estimate, and it indeed makes all the
estimates almost perfect (including the internal ones in cost_index). If
we can't cost such simple cases correctly, what's the point of costing?

From a robustness point of view, wouldn't it be better to actually err
on the side of using a higher random_page_cost value? That'd mean we
flip to "more-sequential" scans sooner, with much "flatter" behavior.
That doesn't need to be a seqscan (which is about as flat as it gets),
but e.g. a bitmap scan - which probably silently "fixes" many cases
where the index scan gets costed too low.

It also says this:

And the value of 4 seems to work well in practice.

I wonder how do we know that? Most users don't experiment with different
values very much. They just run with the default, or maybe even lower
it, based on some recommendation. But they don't run the same query with
different values, so they can't spot differences unless they hit a
particularly bad plan.

Of course, it's also true most workloads tend to access well cached
data, which makes errors much cheaper. Or maybe just queries with the
"problematic selectivities" are not that common. Still, even if it
doesn't change the scan choice, it seems important to keep the cost
somewhat closer to reality because of the plan nodes above ...

It seems to me the current default is a bit too low, but changing a GUC
this important is not trivial. So what should we do about it?

regards

[1]
/messages/by-id/14601.949786166@sss.pgh.pa.us

[2]
/messages/by-id/3866858.1728961439@sss.pgh.pa.us

[3] /messages/by-id/23625.1223642230@sss.pgh.pa.us

--
Tomas Vondra

#3David Rowley
dgrowleyml@gmail.com
In reply to: wenhui qiu (#2)
Re: Should we update the random_page_cost default value?

On Mon, 6 Oct 2025 at 17:19, wenhui qiu <qiuwenhuifx@gmail.com> wrote:

I really can't agree more. Many default values are just too conservative, and the documentation doesn't provide best practices.,i think reduce to 1.x,Or add a tip in the document, providing a recommended value for different SSDs.

Did you read Tomas's email or just the subject line? I think if
you're going to propose to move it in the opposite direction as to
what Tomas found to be the more useful direction, then that at least
warrants providing some evidence to the contrary of what Tomas has
shown or stating that you think his methodology for his calculation is
flawed because...

I suspect all you've done here is propagate the typical advice people
give out around here. It appears to me that Tomas went to great
lengths to not do that.

David

#4Pavel Stehule
pavel.stehule@gmail.com
In reply to: David Rowley (#3)
Re: Should we update the random_page_cost default value?

po 6. 10. 2025 v 6:46 odesílatel David Rowley <dgrowleyml@gmail.com> napsal:

On Mon, 6 Oct 2025 at 17:19, wenhui qiu <qiuwenhuifx@gmail.com> wrote:

I really can't agree more. Many default values are just too

conservative, and the documentation doesn't provide best practices.,i think
reduce to 1.x,Or add a tip in the document, providing a recommended value
for different SSDs.

Did you read Tomas's email or just the subject line? I think if
you're going to propose to move it in the opposite direction as to
what Tomas found to be the more useful direction, then that at least
warrants providing some evidence to the contrary of what Tomas has
shown or stating that you think his methodology for his calculation is
flawed because...

I suspect all you've done here is propagate the typical advice people
give out around here. It appears to me that Tomas went to great
lengths to not do that.

+1

The problem will be in estimation of the effect of cache. It can be pretty
wide range.

I have a access to not too small eshop in Czech Republic (but it is not
extra big) - It uses today classic stack - Java (ORM), Elastic, Postgres.
The database size is cca 1.9T, shared buffers are 32GB (it handles about
10-20K logged users at one time).

The buffer cache hit ratio is 98.42%. The code is well optimized. This
ratio is not calculated with file system cache.

I believe so for different applications (OLAP) or less well optimized the
cache hit ratio can be much much worse.

Last year I had an experience with customers that had Postgres in clouds,
and common (not extra expensive) discs are not great parameters today. It
is a question if one ratio like random page cost / seq page cost can well
describe dynamic throttling (or dynamic behavior of current clouds io)
where customers frequently touch limits.

Regards

Pavel

Show quoted text

David

#5David Rowley
dgrowleyml@gmail.com
In reply to: Tomas Vondra (#1)
Re: Should we update the random_page_cost default value?

On Mon, 6 Oct 2025 at 13:59, Tomas Vondra <tomas@vondra.me> wrote:

Unless I did some silly mistakes, these results suggest the current 4.0
value is a bit too low, and something like ~20 would be better even on
SSDs. This is not the first time it was suggested a higher default might
be better - see this 2008 post [3]. Of course, that's from before SSDs
became a thing, it's about evolution in hard disks and our code.

Thanks for going to all that effort to calculate that. It was an
interesting read and also very interesting that you found the opposite
to the typical advice that people typically provide.

I don't have any HDDs around to run the script to check the results. I
do have a machine with 2 SSDs, one PCIe3 and one PCIe4. I'll see if I
can get you some results from that.

It would be interesting to see how your calculated values fare when
given a more realistic workload. Say TPC-H or Join Order Benchmark. I
recall that TPCH has both a power and a throughput result, one to test
concurrency and one for single query throughput. I wonder if the same
random_page_cost setting would be preferred in both scenarios. I can
imagine that it might be more useful to have more index pages in
shared buffers when there's strong contention for buffers. It would be
interesting to run some pg_buffercache queries with GROUP BY relkind
to see how much of an effect changing random_page_cost has on the
number of buffers per relkind after each query.

I wonder if the OtterTune people collected any "phone-home"
information feeding back about what the software picked for GUCs. It
would be interesting to know if there was some trend to show what the
best random_page_cost setting was or if the best setting varied based
on the server and workload.

David

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#1)
Re: Should we update the random_page_cost default value?

Tomas Vondra <tomas@vondra.me> writes:

I wonder if it's time to consider updating the random_page_cost default
value. There are multiple reasons to maybe do that.

Re-reading the old links you provided, I was reminded that the 4.0
value came in with some fairly fundamental changes to the cost model,
such as accounting for WHERE-clause evaluation explicitly instead of
assuming that it was down in the noise. I can't help wondering if
it's time for another rethink of what the cost model is, rather than
just messing with its coefficients. I don't have any concrete ideas
about what that should look like, though.

Another angle is that I expect that the ongoing AIO work will largely
destroy the existing model altogether, at least if you think in terms
of the model as trying to predict query execution time. But if what
we're trying to model is net resource demands, with an eye to
minimizing the total system load not execution time of any one query,
maybe we can continue to work with something close to what we've
traditionally done.

No answers here, just more questions ...

regards, tom lane

#7Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Tomas Vondra (#1)
Re: Should we update the random_page_cost default value?

On Mon, 2025-10-06 at 02:59 +0200, Tomas Vondra wrote:

I wonder if it's time to consider updating the random_page_cost default
value. There are multiple reasons to maybe do that.

[experiments]

Unless I did some silly mistakes, these results suggest the current 4.0
value is a bit too low, and something like ~20 would be better even on
SSDs.

[...]

It seems to me the current default is a bit too low, but changing a GUC
this important is not trivial. So what should we do about it?

I have no reason to doubt your experiments and your reasoning.

However, my practical experience is that PostgreSQL tends to favor
sequential scans too much. Often, that happens together with a parallel
plan, and I find PostgreSQL with the default configuration prefer a plan
with two parallel workers performing a sequential scan with a ridiculously
selective (correctly estimated!) filter condition like 500 rows out of a
million over an index scan that is demonstrably faster.

I have no artificial reproducer for that, and I admit that I didn't hunt
down the reason why the planner might prefer such a plan. I just tell
people to lower random_page_cost, and the problem goes away. So I am
clearly fighting symptoms. Often, an alternative solution is to set
max_parallel_workers_per_gather to 0, which seems to suggest that perhaps
the suggestion from [1]/messages/by-id/a5916f83-de79-4a40-933a-fb0d9ba2f5a0@app.fastmail.com is more interesting than I thought.

Anyway, I cannot remember ever having been in a situation where PostgreSQL
prefers a slow index scan, and I had to raise random_page_cost to get a
faster sequential scan. Perhaps that is because slow index scans are often
not drastically slower, perhaps I deal too little with purely analytical
queries.

Again, this doesn't invalidate any of what you said. I just wanted to
share my experiences.

Yours,
Laurenz Albe

[1]: /messages/by-id/a5916f83-de79-4a40-933a-fb0d9ba2f5a0@app.fastmail.com

#8Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Tom Lane (#6)
Re: Should we update the random_page_cost default value?

On Mon, 2025-10-06 at 01:29 -0400, Tom Lane wrote:

But if what
we're trying to model is net resource demands, with an eye to
minimizing the total system load not execution time of any one query,
maybe we can continue to work with something close to what we've
traditionally done.

Did anybody propose that?
I was under the impression that PostgreSQL's cost model tries to
estimate and optimize execution time, not resource consumption.
Changing that would be pretty radical. For example, it would be
quite obvious that we'd have to disable parallel query by default.

But perhaps I misunderstood, or perhaps I am just too conservative.

Yours,
Laurenz Albe

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Laurenz Albe (#8)
Re: Should we update the random_page_cost default value?

Laurenz Albe <laurenz.albe@cybertec.at> writes:

On Mon, 2025-10-06 at 01:29 -0400, Tom Lane wrote:

But if what
we're trying to model is net resource demands, with an eye to
minimizing the total system load not execution time of any one query,
maybe we can continue to work with something close to what we've
traditionally done.

Did anybody propose that?

I just did ;-). If we don't adopt a mindset along that line,
then AIO is going to require some *radical* changes in the
planner's I/O cost models.

I was under the impression that PostgreSQL's cost model tries to
estimate and optimize execution time, not resource consumption.

Yup, that's our traditional view of it. But I wonder how we
will make such estimates in a parallel-I/O world, especially
if we don't try to account for concurrent query activity.
(Which is a place I don't want to go, because it would render
planning results utterly irreproducible.)

But perhaps I misunderstood, or perhaps I am just too conservative.

I'm normally pretty conservative also about changing planner
behavior. But in this context I think we need to be wary of
thinking too small. The fact that people keep coming out with
different ideas of what random_page_cost needs to be suggests
that there's something fundamentally wrong with the concept.

regards, tom lane

#10Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Pavel Stehule (#4)
Re: Should we update the random_page_cost default value?

On 10/6/25 07:25, Pavel Stehule wrote:

po 6. 10. 2025 v 6:46 odesílatel David Rowley <dgrowleyml@gmail.com
<mailto:dgrowleyml@gmail.com>> napsal:

On Mon, 6 Oct 2025 at 17:19, wenhui qiu <qiuwenhuifx@gmail.com
<mailto:qiuwenhuifx@gmail.com>> wrote:

I really can't agree more. Many default values are just too

conservative, and the documentation doesn't provide best
practices.,i think reduce to 1.x,Or add a tip in the document,
providing a recommended value for different SSDs.

Did you read Tomas's email or just the subject line?  I think if
you're going to propose to move it in the opposite direction as to
what Tomas found to be the more useful direction, then that at least
warrants providing some evidence to the contrary of what Tomas has
shown or stating that you think his methodology for his calculation is
flawed because...

I suspect all you've done here is propagate the typical advice people
give out around here.  It appears to me that Tomas went to great
lengths to not do that.

+1

The problem will be in estimation of the effect of cache. It can be
pretty wide range.

I have a access to not too small eshop in Czech Republic (but it is not
extra big) - It uses today classic stack - Java (ORM), Elastic,
Postgres. The database size is cca 1.9T, shared buffers are 32GB (it
handles about 10-20K logged users at one time). 

The buffer cache hit ratio is 98.42%. The code is well optimized. This
ratio is not calculated with file system cache.

I believe so for different applications  (OLAP) or less well optimized
the cache hit ratio can be much much worse.

Last year I had an experience with customers that had Postgres in
clouds, and common (not extra expensive) discs are not great parameters
today. It is a question if one ratio like random page cost / seq page
cost can well describe dynamic throttling (or dynamic behavior of
current clouds io) where customers frequently touch limits.

Perhaps. Estimating cache effects is hard, no argument about that.

The estimation works by assuming no cache, and then adjusting it based
on some rough approximation of cache effects. If we can't get the first
step sufficiently close to reality, there's no chance of getting good
final estimate.

The test queries were intentionally constructed (data size, randomness)
to make caching mostly irrelevant - both for estimation and execution.

regards

--
Tomas Vondra

#11Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: David Rowley (#5)
Re: Should we update the random_page_cost default value?

On 10/6/25 07:26, David Rowley wrote:

On Mon, 6 Oct 2025 at 13:59, Tomas Vondra <tomas@vondra.me> wrote:

Unless I did some silly mistakes, these results suggest the current 4.0
value is a bit too low, and something like ~20 would be better even on
SSDs. This is not the first time it was suggested a higher default might
be better - see this 2008 post [3]. Of course, that's from before SSDs
became a thing, it's about evolution in hard disks and our code.

Thanks for going to all that effort to calculate that. It was an
interesting read and also very interesting that you found the opposite
to the typical advice that people typically provide.

I don't have any HDDs around to run the script to check the results. I
do have a machine with 2 SSDs, one PCIe3 and one PCIe4. I'll see if I
can get you some results from that.

Yeah, that'd be interesting. I don't think it'll be massively different
from the data I collected, but more data is good.

FWIW I suggest modifying the script to use "debug_io_direct = data" and
smaller shared_buffers. That allows using much smaller data sets (and
thus faster runs).

It would be interesting to see how your calculated values fare when
given a more realistic workload. Say TPC-H or Join Order Benchmark. I
recall that TPCH has both a power and a throughput result, one to test
concurrency and one for single query throughput. I wonder if the same
random_page_cost setting would be preferred in both scenarios. I can
imagine that it might be more useful to have more index pages in
shared buffers when there's strong contention for buffers. It would be
interesting to run some pg_buffercache queries with GROUP BY relkind
to see how much of an effect changing random_page_cost has on the
number of buffers per relkind after each query.

Good idea, I'll give TPC-H a try soon. My concern is that for complex
queries it's much harder to pinpoint the problem, and an estimation
error may sometime compensate (or amplify) an earlier one. Worth a try.

As for the concurrency, I don't have a great answer. But perhaps it's
related to Tom's point about AIO. I mean, AIO also turns serial IOs to
concurrent ones, so maybe it's similar to multiple concurrent queries?

I wonder if the OtterTune people collected any "phone-home"
information feeding back about what the software picked for GUCs. It
would be interesting to know if there was some trend to show what the
best random_page_cost setting was or if the best setting varied based
on the server and workload.

No idea, and given OT is gone I doubt we'd get any data.

regards

--
Tomas Vondra

#12Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Tom Lane (#9)
Re: Should we update the random_page_cost default value?

On Mon, 2025-10-06 at 01:53 -0400, Tom Lane wrote:

Laurenz Albe <laurenz.albe@cybertec.at> writes:

On Mon, 2025-10-06 at 01:29 -0400, Tom Lane wrote:

But if what
we're trying to model is net resource demands, with an eye to
minimizing the total system load not execution time of any one query,
maybe we can continue to work with something close to what we've
traditionally done.

Did anybody propose that?

I just did ;-).  If we don't adopt a mindset along that line,
then AIO is going to require some *radical* changes in the
planner's I/O cost models.

I see your point, and actually the idea of the planner targeting
the lowest resource usage ist quite attractive. That is, in a
situation where you want to optimize throughput. I regularly
find myself advising users that if their CPU load is approaching
100%, they had better disable parallel query.

But I am afraid that that would pessimize plans for analytical
queries, where your sole goal is a low response time.

This is far from a serious proposal, but perhaps there could be
a parameter "optimizer_goal" with values "throughput", "response_time"
and "mixed" that determines the default value for other parameters...

Yours,
Laurenz Albe

#13Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Laurenz Albe (#7)
Re: Should we update the random_page_cost default value?

On 10/6/25 07:34, Laurenz Albe wrote:

On Mon, 2025-10-06 at 02:59 +0200, Tomas Vondra wrote:

I wonder if it's time to consider updating the random_page_cost default
value. There are multiple reasons to maybe do that.

[experiments]

Unless I did some silly mistakes, these results suggest the current 4.0
value is a bit too low, and something like ~20 would be better even on
SSDs.

[...]

It seems to me the current default is a bit too low, but changing a GUC
this important is not trivial. So what should we do about it?

I have no reason to doubt your experiments and your reasoning.

However, my practical experience is that PostgreSQL tends to favor
sequential scans too much. Often, that happens together with a parallel
plan, and I find PostgreSQL with the default configuration prefer a plan
with two parallel workers performing a sequential scan with a ridiculously
selective (correctly estimated!) filter condition like 500 rows out of a
million over an index scan that is demonstrably faster.

I have no artificial reproducer for that, and I admit that I didn't hunt
down the reason why the planner might prefer such a plan. I just tell
people to lower random_page_cost, and the problem goes away. So I am
clearly fighting symptoms. Often, an alternative solution is to set
max_parallel_workers_per_gather to 0, which seems to suggest that perhaps
the suggestion from [1] is more interesting than I thought.

I don't doubt your experience, but to draw any conclusions from this
report we'd need to understand why it happens. I suspect there's some
estimation issue (likely unrelated to random_page_cost), or some effect
our cost model does not / can't consider. And lowering random_page_cost
simply compensates for the earlier error.

Anyway, I cannot remember ever having been in a situation where PostgreSQL
prefers a slow index scan, and I had to raise random_page_cost to get a
faster sequential scan. Perhaps that is because slow index scans are often
not drastically slower, perhaps I deal too little with purely analytical
queries.

I mentioned a couple factors mitigating the worst effects at the end of
my initial message.

The "range" where we cost seqscan and indexscan in the wrong way is not
that wide. In the NVMe RAID0 case we should flip at 0.1%, but we flip
(based on cost) at ~1.5%. At which point index scan is 10x more
expensive than seq scan.

Maybe people don't do queries in this range too often? On the 500M table
it's ~5M rows, which for OLTP seems like a lot and for OLAP it's
probably too little. Real data is likely more correlated (and thus
benefit from caching).

And of course, we don't flip between seqscan/indexscan, we also have
bitmapscan, and that behaves much more reasonably.

regards

--
Tomas Vondra

#14Michael Banck
michael.banck@credativ.de
In reply to: Tomas Vondra (#1)
Re: Should we update the random_page_cost default value?

Hi,

On Mon, Oct 06, 2025 at 02:59:16AM +0200, Tomas Vondra wrote:

I started looking at how we calculated the 4.0 default back in 2000.
Unfortunately, there's a lot of info, as Tom pointed out in 2024 [2].
But he outlined how the experiment worked:

- generate large table (much bigger than RAM)
- measure runtime of seq scan
- measure runtime of full-table index scan
- calculate how much more expensive a random page access is

Ok, but I also read somewhere (I think it might have been Bruce in a
recent (last few years) discussion of random_page_cost) that on top of
that, we assumed 90% (or was it 95%?) of the queries were cached in
shared_buffers (probably preferably the indexes), so that while random
access is massively slower than sequential access (surely not 4x by
2000) is offset by that. I only quickly read your mail, but I didn't see
any discussion of caching on first glance, or do you think it does not
matter much?

Michael

#15Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Michael Banck (#14)
Re: Should we update the random_page_cost default value?

On 10/6/25 11:02, Michael Banck wrote:

Hi,

On Mon, Oct 06, 2025 at 02:59:16AM +0200, Tomas Vondra wrote:

I started looking at how we calculated the 4.0 default back in 2000.
Unfortunately, there's a lot of info, as Tom pointed out in 2024 [2].
But he outlined how the experiment worked:

- generate large table (much bigger than RAM)
- measure runtime of seq scan
- measure runtime of full-table index scan
- calculate how much more expensive a random page access is

Ok, but I also read somewhere (I think it might have been Bruce in a
recent (last few years) discussion of random_page_cost) that on top of
that, we assumed 90% (or was it 95%?) of the queries were cached in
shared_buffers (probably preferably the indexes), so that while random
access is massively slower than sequential access (surely not 4x by
2000) is offset by that. I only quickly read your mail, but I didn't see
any discussion of caching on first glance, or do you think it does not
matter much?

I think you're referring to this:

/messages/by-id/1156772.1730397196@sss.pgh.pa.us

As Tom points out, that's not really how we calculated the 4.0 default.
We should probably remove that from the docs.

regards

--
Tomas Vondra

#16Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#6)
Re: Should we update the random_page_cost default value?

On 10/6/25 07:29, Tom Lane wrote:

Tomas Vondra <tomas@vondra.me> writes:

I wonder if it's time to consider updating the random_page_cost default
value. There are multiple reasons to maybe do that.

Re-reading the old links you provided, I was reminded that the 4.0
value came in with some fairly fundamental changes to the cost model,
such as accounting for WHERE-clause evaluation explicitly instead of
assuming that it was down in the noise. I can't help wondering if
it's time for another rethink of what the cost model is, rather than
just messing with its coefficients. I don't have any concrete ideas
about what that should look like, though.

True, maybe it's time for a larger revision. Do you have any thoughts on
how it should be changed? (I don't, but I didn't have the ambition.)

Another angle is that I expect that the ongoing AIO work will largely
destroy the existing model altogether, at least if you think in terms
of the model as trying to predict query execution time. But if what
we're trying to model is net resource demands, with an eye to
minimizing the total system load not execution time of any one query,
maybe we can continue to work with something close to what we've
traditionally done.

No answers here, just more questions ...

I had the same thought, when working on the (index) prefetching. Which
of course now relies on AIO. Without concurrency, there wasn't much
difference between optimizing for resources and time, but AIO changes
that. In fact, parallel query has a similar effect, because it also
spreads the work to multiple concurrent processes.

Parallel query simply divides the cost between workers, as if each use a
fraction of resources. And the cost of the parallel plan is lower than
summing up the per-worker costs. Maybe AIO should do something similar?
That is, estimate the I/O concurrency and lower the cost a bit?

regards

--
Tomas Vondra

#17Jakub Wartak
jakub.wartak@enterprisedb.com
In reply to: Tomas Vondra (#16)
Re: Should we update the random_page_cost default value?

On Mon, Oct 6, 2025 at 11:24 AM Tomas Vondra <tomas@vondra.me> wrote:

On 10/6/25 07:29, Tom Lane wrote:

Tomas Vondra <tomas@vondra.me> writes:

[..]

Another angle is that I expect that the ongoing AIO work will largely
destroy the existing model altogether, at least if you think in terms
of the model as trying to predict query execution time.

[..]

That is, estimate the I/O concurrency and lower the cost a bit?

Side question, out of curiosity: didn't the ship already sail with
introduction of streaming read API back in a while ago? After all, the
io_combine_limit with vectored preadv() has the ability to greatly
accelerate seq scans (that would mean batching up to 16 syscalls while
the kernel is doing its own magic in the background anyway - with
merging/splitting/readaheads). To me it looks like you are
experiencing heavy concurrency benefits at least on that `NVMe/RAID0`
testcase, so one question would be: does the default random_page_cost
be also that inaccurate earlier, on e.g. PG16?

-J.

#18Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#1)
Re: Should we update the random_page_cost default value?

Hi,

On 2025-10-06 02:59:16 +0200, Tomas Vondra wrote:

So I decided to try doing this on a couple different devices, and see
what random_page_cost values that gives me. Attached is a script doing
such benchmark:

(1) initializes a new cluster, with a couple parameters adjusted

(2) creates a random table (with uniform distribution)

(3) runs a sequential scan

SELECT * FROM (SELECT * FROM test_table) OFFSET $nrows;

(4) runs an index scan

SELECT * FROM (SELECT * FROM test_table ORDER BY id) OFFSET $nrows;

Why compare an unordered with an ordered scan? ISTM that if you want to
actually compare the cost of two different approaches to the same query, you'd
not want to actually change what the query does? Yes, you say that CPU time
never gets above 50%, but 50% isn't nothing.

It also seems that due to the ordering inside the table (the order by
random()) during the table creation, you're going to see vastly different
number of page accesses. While that's obviously something worth taking into
account for planning purposes, I don't think it'd be properly done by the
random_page_cost itself.

I think doing this kind of measurement via normal SQL query processing is
almost always going to have too much other influences. I'd measure using fio
or such instead. It'd be interesting to see fio numbers for your disks...

fio --directory /srv/fio --size=8GiB --name test --invalidate=0 --bs=$((8*1024)) --rw read --buffered 0 --time_based=1 --runtime=5 --ioengine pvsync --iodepth 1
vs --rw randread

gives me 51k/11k for sequential/rand on one SSD and 92k/8.7k for another.

It obviously contradicts the advice to set the value closer to 1.0. But
why is that? SSDs are certainly better with random I/0, even if the I/O
is not concurrent and the SSD is not fully utilized. So the 4.0 seems
off, the value should be higher than what we got for SSDs ...

I'd guess that the *vast* majority of PG workloads these days run on networked
block storage. For those typically the actual latency at the storage level is
a rather small fraction of the overall IO latency, which is instead dominated
by network and other related cost (like the indirection to which storage
system to go to and crossing VM/host boundaries). Because the majority of the
IO latency is not affected by the storage latency, but by network lotency, the
random IO/non-random IO difference will play less of a role.

From a robustness point of view, wouldn't it be better to actually err
on the side of using a higher random_page_cost value? That'd mean we
flip to "more-sequential" scans sooner, with much "flatter" behavior.
That doesn't need to be a seqscan (which is about as flat as it gets),
but e.g. a bitmap scan - which probably silently "fixes" many cases
where the index scan gets costed too low.

I think it's often the exact opposite - folks use a lower random page cost to
*prevent* the planner from going to sequential (or bitmap heap) scans. In many
real-world queries our selectivity estimates aren't great and the performance
penalties of switching from an index scan to a sequential scan are really
severe. As you note, this is heavily exascerbated by the hot data often being
cached, but cold data not. Obviously the seqscan will process the cold data
too.

Greetings,

Andres Freund

#19Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#9)
Re: Should we update the random_page_cost default value?

Hi,

On 2025-10-06 01:53:05 -0400, Tom Lane wrote:

Laurenz Albe <laurenz.albe@cybertec.at> writes:

On Mon, 2025-10-06 at 01:29 -0400, Tom Lane wrote:

But if what
we're trying to model is net resource demands, with an eye to
minimizing the total system load not execution time of any one query,
maybe we can continue to work with something close to what we've
traditionally done.

Did anybody propose that?

I just did ;-). If we don't adopt a mindset along that line,
then AIO is going to require some *radical* changes in the
planner's I/O cost models.

FWIW, I think some vaguely-AIO related cost concept already ought to long have
been taken into account, but weren't. I'm not trying to say that we don't need
to do work, but that asynchronizity related issues are bigger than explicit
asynchronuous IO.

E.g., despite there often being a >10x performance difference between a
forward and backward index scan in case the index and table order are well
correlated, we didn't cost them any differently. The reason for that being
that the OS will do readahead for forward scans, but not backward scans.

Here's an example explain analyse on a disk with an artification 1ms latency
(to simulate networked storage):

echo 3 |sudo tee /proc/sys/vm/drop_caches && psql -Xq -f ~/tmp/evict.sql && \
/usr/bin/time -f '%e' psql -Xq -c 'explain analyze SELECT * FROM aiobench ORDER BY serial_val ASC LIMIT 1 OFFSET 100000;'

QUERY PLAN >
-------------------------------------------------------------------------------------------------------------------------------------------------------------->
Limit (cost=4455.58..4455.62 rows=1 width=120) (actual time=49.820..49.821 rows=1.00 loops=1)
Buffers: shared hit=553 read=2145
I/O Timings: shared read=36.096
-> Index Scan using aiobench_serial_val_idx on aiobench (cost=0.57..8908138.12 rows=199957824 width=120) (actual time=2.971..47.261 rows=100001.00 loops=>
Index Searches: 1
Buffers: shared hit=553 read=2145
I/O Timings: shared read=36.096
Planning:
Buffers: shared hit=113 read=23
I/O Timings: shared read=10.618
Planning Time: 10.981 ms
Execution Time: 49.838 ms
(12 rows)

echo 3 |sudo tee /proc/sys/vm/drop_caches && psql -Xq -f ~/tmp/evict.sql && \
/usr/bin/time -f '%e' psql -Xq -c 'explain analyze SELECT * FROM aiobench ORDER BY serial_val DESC LIMIT 1 OFFSET 100000;'
QUERY PLAN >
-------------------------------------------------------------------------------------------------------------------------------------------------------------->
Limit (cost=4455.58..4455.62 rows=1 width=120) (actual time=2138.047..2138.049 rows=1.00 loops=1)
Buffers: shared hit=739 read=2138
I/O Timings: shared read=2123.844
-> Index Scan Backward using aiobench_serial_val_idx on aiobench (cost=0.57..8908138.12 rows=199957824 width=120) (actual time=4.912..2135.519 rows=10000>
Index Searches: 1
Buffers: shared hit=739 read=2138
I/O Timings: shared read=2123.844
Planning:
Buffers: shared hit=112 read=23
I/O Timings: shared read=10.446
Planning Time: 10.816 ms
Execution Time: 2138.067 ms
(12 rows)

36ms vs 2123ms in IO time, with the same cost.

Of course these queries *do* something different, but we often choose ordered
index scans as a way of implementing a query (e.g. as part of fast start
plans), where there are other alternatives of implementing the same query.

Another example of costing around this being bad is that we do not take
effective_io_concurrency into account when planning bitmap heap scans, despite
that often making a *massive* difference in whether a bitmap heap scan is a
good choice or a bad choice. E.g. on the system with the simulated 1ms IO
latency, the difference between 1 and 32 is vast.

eic=1:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=606243.51..606243.52 rows=1 width=8) (actual time=184218.775..184218.776 rows=1.00 loops=1)
Buffers: shared hit=2 read=194846
I/O Timings: shared read=183913.048
-> Bitmap Heap Scan on aiobench (cost=3900.18..605784.10 rows=183767 width=8) (actual time=79.811..184202.181 rows=199465.00 loops=1)
Recheck Cond: ((random_val >= '0.001'::double precision) AND (random_val <= '0.002'::double precision))
Heap Blocks: exact=194299
Buffers: shared hit=2 read=194846
I/O Timings: shared read=183913.048
-> Bitmap Index Scan on aiobench_random_val_idx (cost=0.00..3854.24 rows=183767 width=0) (actual time=37.287..37.288 rows=199465.00 loops=1)
Index Cond: ((random_val >= '0.001'::double precision) AND (random_val <= '0.002'::double precision))
Index Searches: 1
Buffers: shared hit=2 read=547
I/O Timings: shared read=4.972
Planning:
Buffers: shared hit=76 read=24
I/O Timings: shared read=11.571
Planning Time: 12.953 ms
Execution Time: 184218.986 ms

eic=64:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=606243.51..606243.52 rows=1 width=8) (actual time=2962.965..2962.966 rows=1.00 loops=1)
Buffers: shared hit=2 read=194846
I/O Timings: shared read=2316.070
-> Bitmap Heap Scan on aiobench (cost=3900.18..605784.10 rows=183767 width=8) (actual time=82.871..2947.892 rows=199465.00 loops=1)
Recheck Cond: ((random_val >= '0.001'::double precision) AND (random_val <= '0.002'::double precision))
Heap Blocks: exact=194299
Buffers: shared hit=2 read=194846
I/O Timings: shared read=2316.070
-> Bitmap Index Scan on aiobench_random_val_idx (cost=0.00..3854.24 rows=183767 width=0) (actual time=38.526..38.526 rows=199465.00 loops=1)
Index Cond: ((random_val >= '0.001'::double precision) AND (random_val <= '0.002'::double precision))
Index Searches: 1
Buffers: shared hit=2 read=547
I/O Timings: shared read=5.021
Planning:
Buffers: shared hit=76 read=24
I/O Timings: shared read=11.485
Planning Time: 12.890 ms
Execution Time: 2963.154 ms
(18 rows)

A ~60x query execution difference that's not at all represented in the cost
model. The former makes a bitmap heap scan a terrible choice, the latter a
good one...

I was under the impression that PostgreSQL's cost model tries to
estimate and optimize execution time, not resource consumption.

Yup, that's our traditional view of it. But I wonder how we
will make such estimates in a parallel-I/O world, especially
if we don't try to account for concurrent query activity.
(Which is a place I don't want to go, because it would render
planning results utterly irreproducible.)

Another complicated aspect around this is that we don't take caching into
account in any real-world reflecting way. In common workloads the hotly
accessed data is all in shared_buffers, but the cold portion of the data is
not. In such scenarios switching e.g. from an index scan to a sequential scan
will be way worse, since it'll likely eat up a good portion of the IO
bandwidth and possibly pollute the buffer pool, than in a scenario where all
the data is cold.

At the same time, leaving the difficulty of estimating that aside, making
plans depend on the current state of the buffer pool has some fairly obvious,
and massive, downsides. Like unstable plans and never actually ending up in
the "high performance" situation after a restart, due choosing plans that just
work for an empty buffer pool.

I do not have the *slightest* idea for how to improve the situation around
this, even though I think it's fairly important.

But perhaps I misunderstood, or perhaps I am just too conservative.

I'm normally pretty conservative also about changing planner
behavior. But in this context I think we need to be wary of
thinking too small. The fact that people keep coming out with
different ideas of what random_page_cost needs to be suggests
that there's something fundamentally wrong with the concept.

I think one of the big issues with random_page_cost is that it combines two
largely independent things:

1) the increased cost of doing a random IO over sequential IO
2) that random IOs very often are synchronuous and hard to predict / unpredictable

But we had support for doing readahead for some random IO for a long time (the
posix_fadvise() calls within bitmap heap scans), just without taking that into
account from a costing POV.

I suspect that we'll continue to need to somehow distinguish between
random/non-random IO, the differences are simply too large at the storage
level to ignore.

But that we need to add accounting for whether IO is synchronuous and when
not. If you have a bunch of random IOs, but you cheaply know them ahead of
time (say in bitmap heap scan), there should be a different cost for the query
than if there a bunch of random IOs that we cannot realistically predict (say
the IO for inner index pages in an ordered index scan or all accesses as part
of an index nested loop where the inner side is unique).

Unfortunately that will probably make it more problematic that we aren't
modeling resource consumption - costing a query that does 10x as many, but
prefetchable, IOs than a ever so slightly more expensive query is probably not
a tradeoff that most want to pay.

Greetings,

Andres Freund

#20Andres Freund
andres@anarazel.de
In reply to: Laurenz Albe (#7)
Re: Should we update the random_page_cost default value?

Hi,

On 2025-10-06 07:34:53 +0200, Laurenz Albe wrote:

However, my practical experience is that PostgreSQL tends to favor
sequential scans too much. Often, that happens together with a parallel
plan, and I find PostgreSQL with the default configuration prefer a plan
with two parallel workers performing a sequential scan with a ridiculously
selective (correctly estimated!) filter condition like 500 rows out of a
million over an index scan that is demonstrably faster.

I have no artificial reproducer for that, and I admit that I didn't hunt
down the reason why the planner might prefer such a plan. I just tell
people to lower random_page_cost, and the problem goes away. So I am
clearly fighting symptoms. Often, an alternative solution is to set
max_parallel_workers_per_gather to 0, which seems to suggest that perhaps
the suggestion from [1] is more interesting than I thought.

I've seen this quite often too. IIRC in the cases I've actually analyzed in
any depth it came down to a few root causes:

1) Fast start plans (where the planner though that a sequential scan will find
a matching tuple quickly, but doesn't, leading to scanning most of the
table). I frankly think we should just disable these, they're a very low
confidence bet with high costs in the case of a loss.

2) Not taking the likelihood of data already being cached into account leads
to preferring sequential scans due to seq_page_cost, even though the index
scan would not have required any IO

3) Our costing for the cost of predicate evaluation is extremely poor. Among
the reasons are

- There is no difference in cost between common operators, despite
significant real evaluation cost. E.g. int and text operators are not
close in evaluation cost.

- IIRC we disregard the cost of potentially needing to detoast completely,
despite that very easily becoming the determining factor

- Tuple deforming cost. It's a lot more CPU intensive to deform column 105
than column 5, often the index might avoid needing to do the more
epensive deforming, but we don't take that into account.

This often leads to under-estimating the CPU cost of seqscans.

Greetings,

Andres Freund

#21Bruce Momjian
bruce@momjian.us
In reply to: Tomas Vondra (#15)
#22Bruce Momjian
bruce@momjian.us
In reply to: Andres Freund (#18)
#23Andres Freund
andres@anarazel.de
In reply to: Bruce Momjian (#22)
#24Bruce Momjian
bruce@momjian.us
In reply to: Andres Freund (#23)
#25Andres Freund
andres@anarazel.de
In reply to: Bruce Momjian (#24)
#26Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#18)
#27Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#26)
#28Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#27)
#29Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#28)
#30Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#29)
#31Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#30)
#32Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#30)
#33Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: David Rowley (#5)
#34Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#31)
#35Robert Treat
xzilla@users.sourceforge.net
In reply to: Andres Freund (#23)
#36Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#34)
#37Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#32)
#38Greg Sabino Mullane
greg@turnstep.com
In reply to: Robert Treat (#35)
#39Greg Sabino Mullane
greg@turnstep.com
In reply to: Greg Sabino Mullane (#38)
#40Andres Freund
andres@anarazel.de
In reply to: Greg Sabino Mullane (#39)
In reply to: Andres Freund (#40)
#42Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Peter Geoghegan (#41)
#43David Rowley
dgrowleyml@gmail.com
In reply to: Greg Sabino Mullane (#38)
#44Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: David Rowley (#43)
#45Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#42)
#46Ants Aasma
ants.aasma@cybertec.at
In reply to: Tomas Vondra (#16)
#47Robert Haas
robertmhaas@gmail.com
In reply to: Greg Sabino Mullane (#38)
#48Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Ants Aasma (#46)
#49Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#47)
#50Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#49)
#51Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#50)
#52Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#44)
#53Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#52)
#54Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#50)
#55Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#54)
#56Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#55)
#57Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#56)
#58Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#57)
#59Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#22)
#60Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#59)
#61Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#56)
#62Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#61)
#63Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#62)
#64Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#63)
#65Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#48)
#66Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#60)
#67Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#66)
#68Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Bruce Momjian (#67)