cache estimates, cache access cost

Started by Cédric Villemainalmost 15 years ago35 messageshackers
Jump to latest
#1Cédric Villemain
cedric.villemain.debian@gmail.com

Hello

cache estimation and cache access cost are currently not accounted
explicitly: they have a cost associated with but no constants (other
than effective_cache_size but it has a very limited usage).

Every IO cost is build with a derivation of the seq_page_cost,
random_page_cost and the number of pages. Formulas are used in some
places to make the cost more or less, to take into account caching and
data alignment.

There are:

* estimation of page we will find in the postgresql buffer cache
* estimation of page we will find in the operating system cache buffer cache

and they can be compute for :

* first access
* several access

We currently don't make distinction between both cache areas (there is
more cache areas but we don't care here) and we 'prefer' estimate
several access instead of the first one.

There is also a point related to cost estimation, they are strong: for
example once a sort goes over work_mem, its cost jumped because page
access are accounted.

The current cost estimations are already very good, most of our
queries run well without those famous 'HINT' and the planner provide
the best plan in most cases.

But I believe that now we need more tools to improve even more the
cost estimation.
I would like to propose some ideas, not my ideas in all cases, the
topic is in the air since a long time and probably that everything has
already being said (at least around a beer or a pepsi)

Adding a new GUC "cache_page_cost":
- allows to cost the page access when it is estimated in cache
- allows to cost a sort exceeding work_mem but which should not hit disk
- allows to use random_page_cost for what it should be.
(I was tempted by a GUC "write_page_cost" but I am unsure for this one
at this stage)

Adding 2 columns to pg_class "oscache_percent" and "pgcache_percent"
(or similar names): they allow to store stats about the percentage of
a relation in each cache.
- Usage should be to estimate cost of first access to pages then use
the Mackert and Lohman formula on next access. The later only provide
a way to estimate cost of re-reading.

It is hard to advocate here with real expected performance gain other
than: we will have more options for more precise planner decision and
we may reduce the number of report for bad planning. (it is also in
the todolist to improve cache estimation)

--

I've already hack a bit the core for that and added the 2 new columns
with hooks to update them. ANALYZE OSCACHE update one of them and a
plugin can be used to provide the estimate (so how it's filled is not
important, most OSes have solutions to estimate it accurately if
someone wonder)
It is as-is for POC, probably not clean enough to go to commit festand
not expected to go there before some consensus are done.
http://git.postgresql.org/gitweb?p=users/c2main/postgres.git;a=shortlog;h=refs/heads/analyze_cache

--

Hacking costsize is ... dangerous, I would say. Breaking something
which works already so well is easy. Changing only one cost function
is not enough to keep a good balance....
Performance farm should help here ... and the full cycle for 9.2 too.

Comments ?
--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

#2Greg Smith
gsmith@gregsmith.com
In reply to: Cédric Villemain (#1)
Re: cache estimates, cache access cost

C�dric Villemain wrote:

http://git.postgresql.org/gitweb?p=users/c2main/postgres.git;a=shortlog;h=refs/heads/analyze_cache

This rebases easily to make Cedric's changes move to the end; I just
pushed a version with that change to
https://github.com/greg2ndQuadrant/postgres/tree/analyze_cache if anyone
wants a cleaner one to browse. I've attached a patch too if that's more
your thing.

I'd recommend not getting too stuck on the particular hook C�dric has
added here to compute the cache estimate, which uses mmap and mincore to
figure it out. It's possible to compute similar numbers, albeit less
accurate, using an approach similar to how pg_buffercache inspects
things. And I even once wrote a background writer extension that
collected this sort of data as it was running the LRU scan anyway.
Discussions of this idea seem to focus on how the "what's in the cache?"
data is collected, which as far as I'm concerned is the least important
part. There are multiple options, some work better than others, and
there's no reason that can't be swapped out later. The more important
question is how to store the data collected and then use it for
optimizing queries.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us

Attachments:

analyze_cache-v1.patchtext/x-patch; name=analyze_cache-v1.patchDownload+508-35
#3Robert Haas
robertmhaas@gmail.com
In reply to: Greg Smith (#2)
Re: cache estimates, cache access cost

On Sun, May 15, 2011 at 11:52 PM, Greg Smith <greg@2ndquadrant.com> wrote:

Cédric Villemain wrote:

http://git.postgresql.org/gitweb?p=users/c2main/postgres.git;a=shortlog;h=refs/heads/analyze_cache

This rebases easily to make Cedric's changes move to the end; I just pushed
a version with that change to
https://github.com/greg2ndQuadrant/postgres/tree/analyze_cache if anyone
wants a cleaner one to browse.  I've attached a patch too if that's more
your thing.

Thank you. I don't much like sucking in other people's git repos - it
tends to take a lot longer than just opening a patch file, and if I
add the repo as a remote then my git repo ends up bloated. :-(

The more important question is how to store the data collected and
then use it for optimizing queries.

Agreed, but unless I'm missing something, this patch does nothing
about that. I think the first step needs to be to update all the
formulas that are based on random_page_cost and seq_page_cost to
properly take cache_page_cost into account - and in some cases it may
be a bit debatable what the right mathematics are.

For what it's worth, I don't believe for a minute that an analyze
process that may run only run on a given table every six months has a
chance of producing useful statistics about the likelihood that a
table will be cached. The buffer cache can turn over completely in
under a minute, and a minute is a lot less than a month. Now, if we
measured this information periodically for a long period of time and
averaged it, that might be a believable basis for setting an optimizer
parameter. But I think we should take the approach recently discussed
on performance: allow it to be manually set by the administrator on a
per-relation basis, with some reasonable default (maybe based on the
size of the relation relative to effective_cache_size) if the
administrator doesn't intervene. I don't want to be excessively
negative about the approach of examining the actual behavior of the
system and using that to guide system behavior - indeed, I think there
are quite a few places where we would do well to incorporate that
approach to a greater degree than we do currently. But I think that
it's going to take a lot of research, and a lot of work, and a lot of
performance testing, to convince ourselves that we've come up with an
appropriate feedback mechanism that will actually deliver better
performance across a large variety of workloads. It would be much
better, IMHO, to *first* get a cached_page_cost parameter added, even
if the mechanism by which caching percentages are set is initially
quite crude - that will give us a clear-cut benefit that people can
begin enjoying immediately.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#4Cédric Villemain
cedric.villemain.debian@gmail.com
In reply to: Robert Haas (#3)
Re: cache estimates, cache access cost

2011/5/17 Robert Haas <robertmhaas@gmail.com>:

On Sun, May 15, 2011 at 11:52 PM, Greg Smith <greg@2ndquadrant.com> wrote:

Cédric Villemain wrote:

http://git.postgresql.org/gitweb?p=users/c2main/postgres.git;a=shortlog;h=refs/heads/analyze_cache

This rebases easily to make Cedric's changes move to the end; I just pushed
a version with that change to
https://github.com/greg2ndQuadrant/postgres/tree/analyze_cache if anyone
wants a cleaner one to browse.  I've attached a patch too if that's more
your thing.

Thank you.  I don't much like sucking in other people's git repos - it
tends to take a lot longer than just opening a patch file, and if I
add the repo as a remote then my git repo ends up bloated.  :-(

The more important question is how to store the data collected and
then use it for optimizing queries.

Agreed, but unless I'm missing something, this patch does nothing
about that.  I think the first step needs to be to update all the
formulas that are based on random_page_cost and seq_page_cost to
properly take cache_page_cost into account - and in some cases it may
be a bit debatable what the right mathematics are.

Yes, I provide the branch only in case someone want to hack the
costsize and to close the problem of getting stats.

For what it's worth, I don't believe for a minute that an analyze
process that may run only run on a given table every six months has a
chance of producing useful statistics about the likelihood that a
table will be cached.  The buffer cache can turn over completely in
under a minute, and a minute is a lot less than a month.  Now, if we
measured this information periodically for a long period of time and
averaged it, that might be a believable basis for setting an optimizer

The point is to get ratio in cache, not the distribution of the data
in cache (pgfincore also allows you to see this information).
I don't see how a stable (a server in production) system can have its
ratio moving up and down so fast without known pattern.
Maybe it is datawarehouse, so data move a lot, then just update your
per-relation stats before starting your queries as suggested in other
threads. Maybe it is just a matter of frequency of stats update or
explicit request like we *use to do* (ANALYZE foo;) to handle those
situations.

parameter.  But I think we should take the approach recently discussed
on performance: allow it to be manually set by the administrator on a
per-relation basis, with some reasonable default (maybe based on the
size of the relation relative to effective_cache_size) if the
administrator doesn't intervene.  I don't want to be excessively
negative about the approach of examining the actual behavior of the
system and using that to guide system behavior - indeed, I think there
are quite a few places where we would do well to incorporate that
approach to a greater degree than we do currently.  But I think that
it's going to take a lot of research, and a lot of work, and a lot of
performance testing, to convince ourselves that we've come up with an
appropriate feedback mechanism that will actually deliver better
performance across a large variety of workloads.  It would be much
better, IMHO, to *first* get a cached_page_cost parameter added, even
if the mechanism by which caching percentages are set is initially
quite crude - that will give us a clear-cut benefit that people can
begin enjoying immediately.

The plugin I provided is just to be able to do first analysis on how
the os cache size move. You can either use pgfincore to monitor that
per table or use the patch and monitor columns values for *cache.

I took the Hooks approach because it allows to do what you want :)
You can set up a hook where you set the values you want to see, it
allows for example to fix cold start values, or permanent values set
by DBA or ... do what you want here.

The topic is do we need more parameters to increase the value of our planner ?
1/ cache_page_cost
2/ cache information, arbitrary set or not.

Starting with 1/ is ok for me, I prefer to try both at once if
possible to remove the pain to hack twice costsize.c

Several items are to be discussed after that: formulas to handle
'small' tables, data distribution usage (this one hit an old topic
about auto-partitionning as we are here), cold state, hot state, ...

PS: there is very good blocker for the pg_class changes : what happens
in a standby ? Maybe it just opens the door on how to unlock that or
find another option to get the information per table but distinct per
server. (or we don't care, at least for a first implementation, like
for other parameters)
--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

#5Robert Haas
robertmhaas@gmail.com
In reply to: Cédric Villemain (#4)
Re: cache estimates, cache access cost

On Tue, May 17, 2011 at 6:11 PM, Cédric Villemain
<cedric.villemain.debian@gmail.com> wrote:

The point is to get ratio in cache, not the distribution of the data
in cache (pgfincore also allows you to see this information).
I don't see how a stable (a server in production) system can have its
ratio moving up and down so fast without known pattern.

Really? It doesn't seem that hard to me. For example, your nightly
reports might use a different set of tables than are active during the
day....

PS: there is very good blocker for the pg_class changes : what happens
in a standby ? Maybe it just opens the door on how to unlock that or
find another option to get the information per table but distinct per
server. (or we don't care, at least for a first implementation, like
for other parameters)

That's a good point, too.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#6Cédric Villemain
cedric.villemain.debian@gmail.com
In reply to: Robert Haas (#5)
Re: cache estimates, cache access cost

2011/5/19 Robert Haas <robertmhaas@gmail.com>:

On Tue, May 17, 2011 at 6:11 PM, Cédric Villemain
<cedric.villemain.debian@gmail.com> wrote:

The point is to get ratio in cache, not the distribution of the data
in cache (pgfincore also allows you to see this information).
I don't see how a stable (a server in production) system can have its
ratio moving up and down so fast without known pattern.

Really?  It doesn't seem that hard to me.  For example, your nightly
reports might use a different set of tables than are active during the
day....

yes, this is known pattern, I believe we can work with it.

PS: there is very good blocker for the pg_class changes : what happens
in a standby ? Maybe it just opens the door on how to unlock that or
find another option to get the information per table but distinct per
server. (or we don't care, at least for a first implementation, like
for other parameters)

That's a good point, too.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

#7Robert Haas
robertmhaas@gmail.com
In reply to: Cédric Villemain (#6)
Re: cache estimates, cache access cost

On Thu, May 19, 2011 at 8:19 AM, Cédric Villemain
<cedric.villemain.debian@gmail.com> wrote:

2011/5/19 Robert Haas <robertmhaas@gmail.com>:

On Tue, May 17, 2011 at 6:11 PM, Cédric Villemain
<cedric.villemain.debian@gmail.com> wrote:

The point is to get ratio in cache, not the distribution of the data
in cache (pgfincore also allows you to see this information).
I don't see how a stable (a server in production) system can have its
ratio moving up and down so fast without known pattern.

Really?  It doesn't seem that hard to me.  For example, your nightly
reports might use a different set of tables than are active during the
day....

yes, this is known pattern, I believe we can work with it.

I guess the case where I agree that this would be relatively static is
on something like a busy OLTP system. If different users access
different portions of the main tables, which parts of each relation
are hot might move around, but overall the percentage of that relation
in cache probably won't move around a ton, except perhaps just after
running a one-off reporting query, or when the system is first
starting up.

But that's not everybody's workload. Imagine a system that is
relatively lightly used. Every once in a while someone comes along
and runs a big reporting query. Well, the contents of the buffer
caches are might vary considerably depending on *which* big reporting
queries ran most recently.

Also, even if we knew what was going to be in cache at the start of
the query, the execution of the query might change things greatly as
it runs. For example, imagine a join between some table and itself.
If we estimate that none of the data is i cache, we will almost
certainly be wrong, because it's likely both sides of the join are
going to access some of the same pages. Exactly how many depends on
the details of the join condition and whether we choose to implement
it by merging, sorting, or hashing. But it's likely going to be more
than zero. This problem can also arise in other contexts - for
example, if a query accesses a bunch of large tables, the tables that
are accessed later in the computation might be less cached than the
ones accessed earlier in the computation, because the earlier accesses
pushed parts of the tables accessed later out of cache. Or, if a
query requires a large sort, and the value of work_mem is very high
(say 1GB), the sort might evict data from cache. Now maybe none of
this matters a bit in practice, but it's something to think about.

There was an interesting report on a problem along these lines from
Kevin Grittner a while back. He found he needed to set seq_page_cost
and random_page_cost differently for the database user that ran the
nightly reports, precisely because the degree of caching was very
different than it was for the daily activity, and he got bad plans
otherwise.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#8Cédric Villemain
cedric.villemain.debian@gmail.com
In reply to: Robert Haas (#7)
Re: cache estimates, cache access cost

2011/5/19 Robert Haas <robertmhaas@gmail.com>:

On Thu, May 19, 2011 at 8:19 AM, Cédric Villemain
<cedric.villemain.debian@gmail.com> wrote:

2011/5/19 Robert Haas <robertmhaas@gmail.com>:

On Tue, May 17, 2011 at 6:11 PM, Cédric Villemain
<cedric.villemain.debian@gmail.com> wrote:

The point is to get ratio in cache, not the distribution of the data
in cache (pgfincore also allows you to see this information).
I don't see how a stable (a server in production) system can have its
ratio moving up and down so fast without known pattern.

Really?  It doesn't seem that hard to me.  For example, your nightly
reports might use a different set of tables than are active during the
day....

yes, this is known pattern, I believe we can work with it.

I guess the case where I agree that this would be relatively static is
on something like a busy OLTP system.  If different users access
different portions of the main tables, which parts of each relation
are hot might move around, but overall the percentage of that relation
in cache probably won't move around a ton, except perhaps just after
running a one-off reporting query, or when the system is first
starting up.

yes.

But that's not everybody's workload.  Imagine a system that is
relatively lightly used.  Every once in a while someone comes along
and runs a big reporting query.  Well, the contents of the buffer
caches are might vary considerably depending on *which* big reporting
queries ran most recently.

Yes, I agree. This scenario is for the case where oscache_percent and
pgcache_percent are subject to change I guess. We can defined 1/ if
the values can/need to be change 2/ when update the values. For 2/ the
database usage may help to trigger an ANALYZE when required. But to be
honest I'd like to hear more of the strategy suggested by Greg here.

Those scenari are good keep in mind to build good indicators for both
the plugin to do the ANALYZE and to solve 2/

Also, even if we knew what was going to be in cache at the start of
the query, the execution of the query might change things greatly as
it runs.  For example, imagine a join between some table and itself.
If we estimate that none of the data is i cache, we will almost
certainly be wrong, because it's likely both sides of the join are
going to access some of the same pages.  Exactly how many depends on
the details of the join condition and whether we choose to implement
it by merging, sorting, or hashing.  But it's likely going to be more
than zero.  This problem can also arise in other contexts - for
example, if a query accesses a bunch of large tables, the tables that
are accessed later in the computation might be less cached than the
ones accessed earlier in the computation, because the earlier accesses
pushed parts of the tables accessed later out of cache.

Yes I believe the Mackert and Lohman formula has been good so far and
I didn't suggest at any moment to remove it.
It will need some rewrite to handle it with the new GUC and new
pg_class columns but the code is already in the place for that.

Or, if a
query requires a large sort, and the value of work_mem is very high
(say 1GB), the sort might evict data from cache.  Now maybe none of
this matters a bit in practice, but it's something to think about.

Yes I agree again.

There was an interesting report on a problem along these lines from
Kevin Grittner a while back.  He found he needed to set seq_page_cost
and random_page_cost differently for the database user that ran the
nightly reports, precisely because the degree of caching was very
different than it was for the daily activity, and he got bad plans
otherwise.

this is in fact a very interesting use case. I believe the same
strategy can be applied and update cache_page_cost and pg_class.
But I really like if it closes this use case: seq_page_cost,
random_page_cost and cache_page_cost must not need to be changed, they
should be more 'hardware dependent'. What will need to be changed is
in fact the frequency of ANALYZE CACHE in such case (or arbitrary set
values). It should allow the planner and costsize functions to have
accurate values and provide the best plan (again, the cache estimation
coming from the running query remain in the hands of the Mackert and
Lohman).
OK, maybe the user will have to write some ANALYZE CACHE; between some
queries in his scenarios.

Maybe a good scenario to add to the performance farm ? (as others but
this one has the very good value to be a production case)

I'll write those scenarios in a wiki page so it can be used to review
corner cases and possible issues (not now, it is late here).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

#9Cédric Villemain
cedric.villemain.debian@gmail.com
In reply to: Greg Smith (#2)
[WIP] cache estimates, cache access cost

2011/5/16 Greg Smith <greg@2ndquadrant.com>:

Cédric Villemain wrote:

http://git.postgresql.org/gitweb?p=users/c2main/postgres.git;a=shortlog;h=refs/heads/analyze_cache

This rebases easily to make Cedric's changes move to the end; I just pushed
a version with that change to
https://github.com/greg2ndQuadrant/postgres/tree/analyze_cache if anyone
wants a cleaner one to browse.  I've attached a patch too if that's more
your thing.

I'd recommend not getting too stuck on the particular hook Cédric has added
here to compute the cache estimate, which uses mmap and mincore to figure it
out.  It's possible to compute similar numbers, albeit less accurate, using
an approach similar to how pg_buffercache inspects things.  And I even once
wrote a background writer extension that collected this sort of data as it
was running the LRU scan anyway.  Discussions of this idea seem to focus on
how the "what's in the cache?" data is collected, which as far as I'm
concerned is the least important part.  There are multiple options, some
work better than others, and there's no reason that can't be swapped out
later.  The more important question is how to store the data collected and
then use it for optimizing queries.

Attached are updated patches without the plugin itself. I've also
added the cache_page_cost GUC, this one is not per tablespace, like
others page_cost.

There are 6 patches:

0001-Add-reloscache-column-to-pg_class.patch
0002-Add-a-function-to-update-the-new-pg_class-cols.patch
0003-Add-ANALYZE-OSCACHE-VERBOSE-relation.patch
0004-Add-a-Hook-to-handle-OSCache-stats.patch
0005-Add-reloscache-to-Index-Rel-OptInfo.patch
0006-Add-cache_page_cost-GUC.patch

I have some comments on my own code:

* I am not sure of the best datatype to use for 'reloscache'
* I didn't include the catalog number change in the patch itself.
* oscache_update_relstats() is very similar to vac_update_relstats(),
maybe better to merge them but reloscache should not be updated at the
same time than other stats.
* There is probably too much work done in do_oscache_analyze_rel()
because I kept vac_open_indexes() (not a big drama atm)
* I don't know so much how gram.y works, so I am not sure my changes
cover all cases.
* No tests; similar columns and GUC does not have test either, but it
lacks a test for ANALYZE OSCACHE

--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Attachments:

0001-Add-reloscache-column-to-pg_class.patchtext/x-patch; charset=US-ASCII; name=0001-Add-reloscache-column-to-pg_class.patchDownload+58-44
0002-Add-a-function-to-update-the-new-pg_class-cols.patchtext/x-patch; charset=US-ASCII; name=0002-Add-a-function-to-update-the-new-pg_class-cols.patchDownload+46-1
0003-Add-ANALYZE-OSCACHE-VERBOSE-relation.patchtext/x-patch; charset=US-ASCII; name=0003-Add-ANALYZE-OSCACHE-VERBOSE-relation.patchDownload+165-28
0004-Add-a-Hook-to-handle-OSCache-stats.patchtext/x-patch; charset=US-ASCII; name=0004-Add-a-Hook-to-handle-OSCache-stats.patchDownload+17-1
0005-Add-reloscache-to-Index-Rel-OptInfo.patchtext/x-patch; charset=US-ASCII; name=0005-Add-reloscache-to-Index-Rel-OptInfo.patchDownload+9-1
0006-Add-cache_page_cost-GUC.patchtext/x-patch; charset=US-ASCII; name=0006-Add-cache_page_cost-GUC.patchDownload+27-1
#10Robert Haas
robertmhaas@gmail.com
In reply to: Cédric Villemain (#9)
Re: [WIP] cache estimates, cache access cost

On Tue, Jun 14, 2011 at 10:29 AM, Cédric Villemain
<cedric.villemain.debian@gmail.com> wrote:

0001-Add-reloscache-column-to-pg_class.patch
0002-Add-a-function-to-update-the-new-pg_class-cols.patch
0003-Add-ANALYZE-OSCACHE-VERBOSE-relation.patch
0004-Add-a-Hook-to-handle-OSCache-stats.patch
0005-Add-reloscache-to-Index-Rel-OptInfo.patch
0006-Add-cache_page_cost-GUC.patch

It seems to me that posting updated versions of this patch gets us no
closer to addressing the concerns I (and Tom, on other threads)
expressed about this idea previously. Specifically:

1. ANALYZE happens far too infrequently to believe that any data taken
at ANALYZE time will still be relevant at execution time.
2. Using data gathered by ANALYZE will make plans less stable, and our
users complain not infrequently about the plan instability we already
have, therefore we should not add more.
3. Even if the data were accurate and did not cause plan stability, we
have no evidence that using it will improve real-world performance.

Now, it's possible that you or someone else could provide some
experimental evidence refuting these points. But right now there
isn't any, and until there is, -1 from me on applying any of this.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11Cédric Villemain
cedric.villemain.debian@gmail.com
In reply to: Robert Haas (#10)
Re: [WIP] cache estimates, cache access cost

2011/6/14 Robert Haas <robertmhaas@gmail.com>:

On Tue, Jun 14, 2011 at 10:29 AM, Cédric Villemain
<cedric.villemain.debian@gmail.com> wrote:

0001-Add-reloscache-column-to-pg_class.patch
0002-Add-a-function-to-update-the-new-pg_class-cols.patch
0003-Add-ANALYZE-OSCACHE-VERBOSE-relation.patch
0004-Add-a-Hook-to-handle-OSCache-stats.patch
0005-Add-reloscache-to-Index-Rel-OptInfo.patch
0006-Add-cache_page_cost-GUC.patch

It seems to me that posting updated versions of this patch gets us no
closer to addressing the concerns I (and Tom, on other threads)
expressed about this idea previously.  Specifically:

1. ANALYZE happens far too infrequently to believe that any data taken
at ANALYZE time will still be relevant at execution time.

ANALYZE happens when people execute it, else it is auto-analyze and I
am not providing auto-analyze-oscache.
ANALYZE OSCACHE is just a very simple wrapper to update pg_class. The
frequency is not important here, I believe.

2. Using data gathered by ANALYZE will make plans less stable, and our
users complain not infrequently about the plan instability we already
have, therefore we should not add more.

Again, it is hard to do a UPDATE pg_class SET reloscache, so I used
ANALYZE logic.
Also I have taken into account the fact that someone may want to SET
the values like it was also suggested, so my patches allow to do :
'this table is 95% in cache, the DBA said' (it is stable, not based on
OS stats).

This case has been suggested several times and is covered by my patch.

3. Even if the data were accurate and did not cause plan stability, we
have no evidence that using it will improve real-world performance.

I have not finish my work on cost estimation and I believe this work
will take some time and can be done in another commitfest. At the
moment my patches do not change anything on the dcision of the
planner, just offers the tools I need to hack cost estimates.

Now, it's possible that you or someone else could provide some
experimental evidence refuting these points.  But right now there
isn't any, and until there is, -1 from me on applying any of this.

I was trying to split the patch size by group of features to reduce
its size. The work is in progress.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

#12Greg Smith
gsmith@gregsmith.com
In reply to: Robert Haas (#10)
Re: [WIP] cache estimates, cache access cost

On 06/14/2011 11:04 AM, Robert Haas wrote:

Even if the data were accurate and did not cause plan stability, we
have no evidence that using it will improve real-world performance.

That's the dependency C�dric has provided us a way to finally make
progress on. Everyone says there's no evidence that this whole approach
will improve performance. But we can't collect such data, to prove or
disprove it helps, without a proof of concept patch that implements
*something*. You may not like the particular way the data is collected
here, but it's a working implementation that may be useful for some
people. I'll take "data collected at ANALYZE time" as a completely
reasonable way to populate the new structures with realistic enough test
data to use initially.

Surely at least one other way to populate the statistics, and possibly
multiple other ways that the user selects, will be needed eventually. I
commented a while ago on this thread: every one of these discussions
always gets dragged into the details of how the cache statistics data
will be collected and rejects whatever is suggested as not good enough.
Until that stops, no progress will ever get made on the higher level
details. By its nature, developing toward integrating cached
percentages is going to lurch forward on both "collecting the cache
data" and "using the cache knowledge in queries" fronts almost
independently. This is not a commit candidate; it's the first useful
proof of concept step for something we keep talking about but never
really doing.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us

#13Robert Haas
robertmhaas@gmail.com
In reply to: Greg Smith (#12)
Re: [WIP] cache estimates, cache access cost

On Tue, Jun 14, 2011 at 1:10 PM, Greg Smith <greg@2ndquadrant.com> wrote:

On 06/14/2011 11:04 AM, Robert Haas wrote:

Even if the data were accurate and did not cause plan stability, we
have no evidence that using it will improve real-world performance.

That's the dependency Cédric has provided us a way to finally make progress
on.  Everyone says there's no evidence that this whole approach will improve
performance.  But we can't collect such data, to prove or disprove it helps,
without a proof of concept patch that implements *something*.  You may not
like the particular way the data is collected here, but it's a working
implementation that may be useful for some people.  I'll take "data
collected at ANALYZE time" as a completely reasonable way to populate the
new structures with realistic enough test data to use initially.

But there's no reason that code (which may or may not eventually prove
useful) has to be incorporated into the main tree. We don't commit
code so people can go benchmark it; we ask for the benchmarking to be
done first, and then if the results are favorable, we commit the code.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#14Robert Haas
robertmhaas@gmail.com
In reply to: Cédric Villemain (#11)
Re: [WIP] cache estimates, cache access cost

On Tue, Jun 14, 2011 at 12:06 PM, Cédric Villemain
<cedric.villemain.debian@gmail.com> wrote:

1. ANALYZE happens far too infrequently to believe that any data taken
at ANALYZE time will still be relevant at execution time.

ANALYZE happens when people execute it, else it is auto-analyze and I
am not providing auto-analyze-oscache.
ANALYZE OSCACHE is just a very simple wrapper to update pg_class. The
frequency is not important here, I believe.

Well, I'm not saying you have to have all the answers to post a WIP
patch, certainly. But in terms of getting something committable, it
seems like we need to have at least an outline of what the long-term
plan is. If ANALYZE OSCACHE is an infrequent operation, then the data
isn't going to be a reliable guide to what will happen at execution
time...

2. Using data gathered by ANALYZE will make plans less stable, and our
users complain not infrequently about the plan instability we already
have, therefore we should not add more.

...and if it is a frequent operation then it's going to result in
unstable plans (and maybe pg_class bloat). There's a fundamental
tension here that I don't think you can just wave your hands at.

I was trying to split the patch size by group of features to reduce
its size. The work is in progress.

Totally reasonable, but I can't see committing any of it without some
evidence that there's light at the end of the tunnel. No performance
tests *whatsoever* have been done. We can debate the exact amount of
evidence that should be required to prove that something is useful
from a performance perspective, but we at least need some. I'm
beating on this point because I believe that the whole idea of trying
to feed this information back into the planner is going to turn out to
be something that we don't want to do. I think it's going to turn out
to have downsides that are far larger than the upsides. I am
completely willing to be be proven wrong, but right now I think this
will make things worse and you think it will make things better and I
don't see any way to bridge that gap without doing some measurements.

For example, if you run this patch on a system and subject that system
to a relatively even workload, how much do the numbers bounce around
between runs? What if you vary the workload, so that you blast it
with OLTP traffic at some times and then run reporting queries at
other times? Or different tables become hot at different times?

Once you've written code to make the planner do something with the
caching % values, then you can start to explore other questions. Can
you generate plan instability, especially on complex queries, which
are more prone to change quickly based on small changes in the cost
estimates? Can you demonstrate a workload where bad performance is
inevitable with the current code, but with your code, the system
becomes self-tuning and ends up with good performance? What happens
if you have a large cold table with a small hot end where all activity
is concentrated?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#15Cédric Villemain
cedric.villemain.debian@gmail.com
In reply to: Robert Haas (#14)
Re: [WIP] cache estimates, cache access cost

2011/6/14 Robert Haas <robertmhaas@gmail.com>:

On Tue, Jun 14, 2011 at 12:06 PM, Cédric Villemain
<cedric.villemain.debian@gmail.com> wrote:

1. ANALYZE happens far too infrequently to believe that any data taken
at ANALYZE time will still be relevant at execution time.

ANALYZE happens when people execute it, else it is auto-analyze and I
am not providing auto-analyze-oscache.
ANALYZE OSCACHE is just a very simple wrapper to update pg_class. The
frequency is not important here, I believe.

Well, I'm not saying you have to have all the answers to post a WIP
patch, certainly.  But in terms of getting something committable, it
seems like we need to have at least an outline of what the long-term
plan is.  If ANALYZE OSCACHE is an infrequent operation, then the data
isn't going to be a reliable guide to what will happen at execution
time...

Ok.

2. Using data gathered by ANALYZE will make plans less stable, and our
users complain not infrequently about the plan instability we already
have, therefore we should not add more.

...and if it is a frequent operation then it's going to result in
unstable plans (and maybe pg_class bloat).  There's a fundamental
tension here that I don't think you can just wave your hands at.

I don't want to hide that point, which is just correct.
The idea is not to have something (which need to be) updated too much
but it needs to be taken into account.

I was trying to split the patch size by group of features to reduce
its size. The work is in progress.

Totally reasonable, but I can't see committing any of it without some
evidence that there's light at the end of the tunnel.  No performance
tests *whatsoever* have been done.  We can debate the exact amount of
evidence that should be required to prove that something is useful
from a performance perspective, but we at least need some.  I'm
beating on this point because I believe that the whole idea of trying
to feed this information back into the planner is going to turn out to
be something that we don't want to do.  I think it's going to turn out
to have downsides that are far larger than the upsides.

it is possible, yes.
I try to do changes in a way that if the reloscache values is the one
by default then the planner keep the same behavior than in the past.

 I am
completely willing to be be proven wrong, but right now I think this
will make things worse and you think it will make things better and I
don't see any way to bridge that gap without doing some measurements.

correct.

For example, if you run this patch on a system and subject that system
to a relatively even workload, how much do the numbers bounce around
between runs?  What if you vary the workload, so that you blast it
with OLTP traffic at some times and then run reporting queries at
other times?  Or different tables become hot at different times?

This is all true, this is *already* true.
Like the thread about random_page_cost vs index_page_cost where the
good option is to change the parameters at certain moment in the day
(IIRC the use case).

I mean that I agree that those benchs need to be done, hopefully I can
fix some usecases, while not breaking others too much or not at all,
or ...

Once you've written code to make the planner do something with the
caching % values, then you can start to explore other questions.  Can
you generate plan instability, especially on complex queries, which
are more prone to change quickly based on small changes in the cost
estimates?  Can you demonstrate a workload where bad performance is
inevitable with the current code, but with your code, the system

My next step is cost estimation changes. I have already some very
small usecases where the minimum changes I did so far are interesting
but it is not enought to come with that as evidences.

becomes self-tuning and ends up with good performance?  What happens
if you have a large cold table with a small hot end where all activity
is concentrated?

We are at step 3 here :-) I have already some ideas to handle those
situations but not yet polished.

The current idea is to be conservative, like PostgreSQL used to be, for example:

/*
* disk and cache costs
* this assumes an agnostic knowledge of the data repartition and query
* usage despite large tables may have a hot part of 10% which is the only
* requested part or that we select only (c)old data so the cache useless.
* We keep the original strategy to not guess too much and just ponderate
* the cost globaly.
*/
run_cost += baserel->pages * ( spc_seq_page_cost * (1 - baserel->oscache)
+ cache_page_cost * baserel->oscache);

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

#16Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Cédric Villemain (#9)
Re: [WIP] cache estimates, cache access cost

Excerpts from Cédric Villemain's message of mar jun 14 10:29:36 -0400 2011:

Attached are updated patches without the plugin itself. I've also
added the cache_page_cost GUC, this one is not per tablespace, like
others page_cost.

There are 6 patches:

0001-Add-reloscache-column-to-pg_class.patch

Hmm, do you really need this to be a new column? Would it work to have
it be a reloption?

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#17Cédric Villemain
cedric.villemain.debian@gmail.com
In reply to: Alvaro Herrera (#16)
Re: [WIP] cache estimates, cache access cost

2011/6/14 Alvaro Herrera <alvherre@commandprompt.com>:

Excerpts from Cédric Villemain's message of mar jun 14 10:29:36 -0400 2011:

Attached are updated patches without the plugin itself. I've also
added the cache_page_cost GUC, this one is not per tablespace, like
others page_cost.

There are 6 patches:

0001-Add-reloscache-column-to-pg_class.patch

Hmm, do you really need this to be a new column?  Would it work to have
it be a reloption?

If we can have ALTER TABLE running on heavy workload, why not.
I am bit scared by the effect of such reloption, it focus on HINT
oriented strategy when I would like to allow a dynamic strategy from
the server. This work is not done and may not work, so a reloption is
good at least as a backup (and is more in the idea suggested by Tom
and others)

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

#18Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Cédric Villemain (#17)
Re: [WIP] cache estimates, cache access cost

Excerpts from Cédric Villemain's message of mar jun 14 17:10:20 -0400 2011:

If we can have ALTER TABLE running on heavy workload, why not.
I am bit scared by the effect of such reloption, it focus on HINT
oriented strategy when I would like to allow a dynamic strategy from
the server. This work is not done and may not work, so a reloption is
good at least as a backup (and is more in the idea suggested by Tom
and others)

Hmm, sounds like yet another use case for pg_class_nt. Why do these
keep popping up?

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#16)
Re: [WIP] cache estimates, cache access cost

Alvaro Herrera <alvherre@commandprompt.com> writes:

Excerpts from Cédric Villemain's message of mar jun 14 10:29:36 -0400 2011:

0001-Add-reloscache-column-to-pg_class.patch

Hmm, do you really need this to be a new column? Would it work to have
it be a reloption?

If it's to be updated in the same way as ANALYZE updates reltuples and
relpages (ie, an in-place non-transactional update), I think it'll have
to be a real column.

regards, tom lane

#20Greg Smith
gsmith@gregsmith.com
In reply to: Robert Haas (#13)
Re: [WIP] cache estimates, cache access cost

On 06/14/2011 01:16 PM, Robert Haas wrote:

But there's no reason that code (which may or may not eventually prove
useful) has to be incorporated into the main tree. We don't commit
code so people can go benchmark it; we ask for the benchmarking to be
done first, and then if the results are favorable, we commit the code.

Who said anything about this being a commit candidate? The "WIP" in the
subject says it's not intended to be. The community asks people to
submit design ideas early so that ideas around them can be explored
publicly. One of the things that needs to be explored, and that could
use some community feedback, is exactly how this should be benchmarked
in the first place. This topic--planning based on cached
percentage--keeps coming up, but hasn't gone very far as an abstract
discussion. Having a patch to test lets it turn to a concrete one.

Note that I already listed myself as the reviewer here, so it's not
even like this is asking explicitly for a community volunteer to help.
Would you like us to research this privately and then dump a giant patch
that is commit candidate quality on everyone six months from now,
without anyone else getting input to the process, or would you like the
work to happen here? I recommended C�dric not ever bother soliciting
ideas early, because I didn't want to get into this sort of debate. I
avoid sending anything here unless I already have a strong idea about
the solution, because it's hard to keep criticism at bay even with
that. He was more optimistic about working within the community
contribution guidelines and decided to send this over early instead. If
you feel this is too rough to even discuss, I'll mark it returned with
feedback and we'll go develop this ourselves.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us

#21Bruce Momjian
bruce@momjian.us
In reply to: Greg Smith (#20)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Smith (#20)
#23Greg Smith
gsmith@gregsmith.com
In reply to: Tom Lane (#22)
#24Robert Haas
robertmhaas@gmail.com
In reply to: Greg Smith (#20)
#25Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#10)
#26Cédric Villemain
cedric.villemain.debian@gmail.com
In reply to: Bruce Momjian (#25)
#27Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#25)
#28Cédric Villemain
cedric.villemain.debian@gmail.com
In reply to: Robert Haas (#27)
#29Greg Smith
gsmith@gregsmith.com
In reply to: Bruce Momjian (#25)
#30Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Greg Smith (#29)
#31Greg Smith
gsmith@gregsmith.com
In reply to: Kevin Grittner (#30)
#32Robert Haas
robertmhaas@gmail.com
In reply to: Cédric Villemain (#28)
#33Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Greg Smith (#31)
#34Greg Smith
gsmith@gregsmith.com
In reply to: Kevin Grittner (#33)
#35Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Greg Smith (#34)