Idea for the statistics collector

Started by Martijn van Oosterhoutover 23 years ago23 messages
#1Martijn van Oosterhout
kleptog@svana.org

Since it's currently all for collecting statistics on tables, why can't it
collect another type of statistic, like:

- How often the estimator gets it wrong?

At the end of an index scan, the executor could compare the number of rows
returned against what was estimated, and if it falls outside a certain
range, flag it.

Also, the average ratio of rows coming out of a distinct node vs the number
going in.

For a join clause, the amount of correlation between two columns (hard).

etc

Ideally, the planner could then use this info to make better plans.
Eventually, the whole system could become somewhat self-tuning.

Does anyone see any problems with this?
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

There are 10 kinds of people in the world, those that can do binary
arithmetic and those that can't.

#2Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Martijn van Oosterhout (#1)
Re: [GENERAL] Idea for the statistics collector

Martijn van Oosterhout wrote:

Since it's currently all for collecting statistics on tables, why can't it
collect another type of statistic, like:

- How often the estimator gets it wrong?

At the end of an index scan, the executor could compare the number of rows
returned against what was estimated, and if it falls outside a certain
range, flag it.

Also, the average ratio of rows coming out of a distinct node vs the number
going in.

For a join clause, the amount of correlation between two columns (hard).

etc

Ideally, the planner could then use this info to make better plans.
Eventually, the whole system could become somewhat self-tuning.

Does anyone see any problems with this?

[ Discussion moved to hackers.]

I have thought that some type of feedback from the executor back into
the optimizer would be a good feature. Not sure how to do it, but your
idea makes sense. It certainly could update the table statistics after
a sequential scan.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#3Neil Conway
nconway@klamath.dyndns.org
In reply to: Bruce Momjian (#2)
Re: [GENERAL] Idea for the statistics collector

On Thu, 20 Jun 2002 22:50:04 -0400 (EDT)
"Bruce Momjian" <pgman@candle.pha.pa.us> wrote:

I have thought that some type of feedback from the executor back into
the optimizer would be a good feature. Not sure how to do it, but your
idea makes sense. It certainly could update the table statistics after
a sequential scan.

Search the archives for a thread I started on -hackers called "self-tuning
histograms", which talks about a pretty similar idea. The technique there
applies only to histograms, and builds the histogram based *only* upon
the data provided by the executor.

Tom commented that it's probably a better idea to concentrate on more
elementary techniques, like multi-dimensional histograms, before starting
on ST histograms. I agree, and plan to look at multi-dimensional histograms
when I get some spare time.

Cheers,

Neil

--
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC

#4Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Neil Conway (#3)
Re: [GENERAL] Idea for the statistics collector

Neil Conway wrote:

On Thu, 20 Jun 2002 22:50:04 -0400 (EDT)
"Bruce Momjian" <pgman@candle.pha.pa.us> wrote:

I have thought that some type of feedback from the executor back into
the optimizer would be a good feature. Not sure how to do it, but your
idea makes sense. It certainly could update the table statistics after
a sequential scan.

Search the archives for a thread I started on -hackers called "self-tuning
histograms", which talks about a pretty similar idea. The technique there
applies only to histograms, and builds the histogram based *only* upon
the data provided by the executor.

Tom commented that it's probably a better idea to concentrate on more
elementary techniques, like multi-dimensional histograms, before starting
on ST histograms. I agree, and plan to look at multi-dimensional histograms
when I get some spare time.

I was thinking of something much more elementary, like a table that
reports to have 50 blocks but an executor sequential scan shows 500
blocks.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#5Dann Corbit
DCorbit@connx.com
In reply to: Bruce Momjian (#4)
Re: [GENERAL] Idea for the statistics collector

Here are some class notes that contain some very good ideas with
terrific explanations:
http://www.cs.duke.edu/education/courses/fall01/cps216/

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Martijn van Oosterhout (#1)
Re: Idea for the statistics collector

Martijn van Oosterhout <kleptog@svana.org> writes:

Since it's currently all for collecting statistics on tables, why can't it
collect another type of statistic, like:
- How often the estimator gets it wrong?
[snip]
Does anyone see any problems with this?

(1) forced overhead on *every* query.
(2) contention to update the same rows of pg_statistic (or wherever you
plan to store this info).
(3) okay, so the estimate was wrong; exactly which of the many
parameters that went into the estimate do you plan to twiddle?
What if it's not the parameter values that are at fault, but the
cost-model equations themselves?

Closed-loop feedback is a great thing when you understand the dynamics
of the system you intend to apply feedback control to. When you don't,
it's a great way to shoot yourself in the foot. Unfortunately I don't
think the PG optimizer falls in the first category at present.

regards, tom lane

#7Martijn van Oosterhout
kleptog@svana.org
In reply to: Tom Lane (#6)
Re: Idea for the statistics collector

On Fri, Jun 21, 2002 at 12:47:18AM -0400, Tom Lane wrote:

Martijn van Oosterhout <kleptog@svana.org> writes:

Since it's currently all for collecting statistics on tables, why can't it
collect another type of statistic, like:
- How often the estimator gets it wrong?
[snip]
Does anyone see any problems with this?

(1) forced overhead on *every* query.

If yo don't want it, don't use it. The current statistics have the same
issue and you can not do those as well.

(2) contention to update the same rows of pg_statistic (or wherever you
plan to store this info).

True, can't avoid that. Depends on how many queries you. Maybe only enable
it for specific sessions?

(3) okay, so the estimate was wrong; exactly which of the many
parameters that went into the estimate do you plan to twiddle?
What if it's not the parameter values that are at fault, but the
cost-model equations themselves?

Firstly, I was only thinking of going for the basic nodes (Index Scan, Seq
Scan, Distinct). Other types have far more variables. Secondly, even if you
only count, it's useful. For example, if it tells you that the planner is
off by a factor of 10 more than 75% of the time, that's useful information
independant of what the actual variables are.

Closed-loop feedback is a great thing when you understand the dynamics
of the system you intend to apply feedback control to. When you don't,
it's a great way to shoot yourself in the foot. Unfortunately I don't
think the PG optimizer falls in the first category at present.

Using the results for planning is obviously a tricky area and should proceed
with caution. But just collecting statistics shouldn't be too bad?

See also -hackers.
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

There are 10 kinds of people in the world, those that can do binary
arithmetic and those that can't.

#8Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Martijn van Oosterhout (#7)
Re: Idea for the statistics collector

Martijn van Oosterhout wrote:

Firstly, I was only thinking of going for the basic nodes (Index Scan, Seq
Scan, Distinct). Other types have far more variables. Secondly, even if you
only count, it's useful. For example, if it tells you that the planner is
off by a factor of 10 more than 75% of the time, that's useful information
independant of what the actual variables are.

Yes, only updating the stats if the estimate was off by a factor of 10
or so should cut down on the overhead.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#8)
Re: Idea for the statistics collector

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Martijn van Oosterhout wrote:

Firstly, I was only thinking of going for the basic nodes (Index Scan, Seq
Scan, Distinct). Other types have far more variables. Secondly, even if you
only count, it's useful. For example, if it tells you that the planner is
off by a factor of 10 more than 75% of the time, that's useful information
independant of what the actual variables are.

Yes, only updating the stats if the estimate was off by a factor of 10
or so should cut down on the overhead.

And reduce the usefulness even more ;-). As a pure stats-gathering
exercise it might be worth doing, but not if you only log the failure
cases. How will you know how well you are doing if you take a
biased-by-design sample?

regards, tom lane

#10Jan Wieck
JanWieck@Yahoo.com
In reply to: Bruce Momjian (#8)
Re: Idea for the statistics collector

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Martijn van Oosterhout wrote:

Firstly, I was only thinking of going for the basic nodes (Index Scan, Seq
Scan, Distinct). Other types have far more variables. Secondly, even if you
only count, it's useful. For example, if it tells you that the planner is
off by a factor of 10 more than 75% of the time, that's useful information
independant of what the actual variables are.

Yes, only updating the stats if the estimate was off by a factor of 10
or so should cut down on the overhead.

And reduce the usefulness even more ;-). As a pure stats-gathering
exercise it might be worth doing, but not if you only log the failure
cases. How will you know how well you are doing if you take a
biased-by-design sample?

Sure is it required to count all cases, success and failure. But I don't
see why it is required to feed that information constantly back into the
statistics tables. As long as we don't restart, it's perfectly good in
the collector. And it must not be fed back to the backend on every
query.

Maybe ANALYZE would like to have some of that information? If memory
serves, ANALYZE does a poor job when the data isn't well distributet,
has few distinct values and the like. That causes wrong estimates then
(among other things, of course). The idea could be, to have ANALYZE take
a much closer look at tables with horrible estimates, to generate better
information for those.

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#11Doug Fields
dfields@pexicom.com
In reply to: Jan Wieck (#10)
Re: Idea for the statistics collector

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Martijn van Oosterhout wrote:

Firstly, I was only thinking of going for the basic nodes (Index

Scan, Seq

Scan, Distinct). Other types have far more variables. Secondly, even

if you

only count, it's useful. For example, if it tells you that the

planner is

off by a factor of 10 more than 75% of the time, that's useful

information

independant of what the actual variables are.

And reduce the usefulness even more ;-). As a pure stats-gathering
exercise it might be worth doing, but not if you only log the failure
cases. How will you know how well you are doing if you take a
biased-by-design sample?

Personally, given that it seems like at least once or twice a day someone
asks about performance or "why isn't my index being used" and other stuff -
I think doing this would be a great idea.

Perhaps not necessarily in the full-fledged way, but creating a sort of
"ANALYZE log," wherein it logs the optimizer's estimate of a query and the
actual results of a query, for every query. This, of course, could be
enableable/disableable on a per-connection basis, per-table basis (like
OIDs), or whatever other basis makes life easiest to the developers.

Then, when the next ANALYZE is run, it could do it's usual analysis, and
apply some additional heuristics based upon what it learns from the
"ANALYZE log," possibly to do several things:

1) Automatically increase/decrease the SET STATISTICS information included
in the analyze, for example, increasing it as a table grows larger and the
"randomness" grows less than linearly with size (e.g., if you have 50 or 60
groups in a 1,000,000 row table, that certainly needs a higher SET
STATISTICS and I do it on my tables).
2) Have an additional value on the statistics table called the
"index_heuristic" or "random_page_adjustment_heuristic" which when 1 does
nothing, but otherwise modifies the cost of using an index/seq scan by that
factor - and don't ever change this more than a few percent each ANALYZE
3) Flags in a second log (maybe the regular log) really bad query estimates
- let it do an analysis of the queries and flag anything two or three std
deviations outside.

Now, I suggest all this stuff in the name of usability and
self-maintainability. Unfortunately, I don't have the wherewithal to
actually assist in development.

Another possibility is to put "use_seq_scan" default to OFF, or whatever
the parameter is (I did my optimizing a while ago so it's fading), so that
if there's an index, it will use it, regardless - as this seems to be what
the great majority of people expect to happen. And/or add this to a FAQ,
and let us all reply "see http://.../indexfaq.html.&quot; :)

Cheers,

Doug

#12Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Bruce Momjian (#2)
Re: [GENERAL] Idea for the statistics collector

I was thinking of writing a command line tool like 'pgtune' that looks at
the stats views and will generate SQL code for, or do automatically the
following:

* Dropping indices that are never used
* Creating appropriate indices to avoid large, expensive sequential scans.

This would put us in the 'mysql makes my indices for me by magic' league -
but would be far more powerful and flexible. How to do multikey indices is
beyond me tho.

*sigh* I'm recovering from a septoplasty on my nose atm, so I might have
some time to do some coding!

Chris

----- Original Message -----
From: "Bruce Momjian" <pgman@candle.pha.pa.us>
To: "Martijn van Oosterhout" <kleptog@svana.org>
Cc: "PostgreSQL-development" <pgsql-hackers@postgresql.org>
Sent: Friday, June 21, 2002 10:50 AM
Subject: Re: [HACKERS] [GENERAL] Idea for the statistics collector

Martijn van Oosterhout wrote:

Since it's currently all for collecting statistics on tables, why can't

it

collect another type of statistic, like:

- How often the estimator gets it wrong?

At the end of an index scan, the executor could compare the number of

rows

returned against what was estimated, and if it falls outside a certain
range, flag it.

Also, the average ratio of rows coming out of a distinct node vs the

number

Show quoted text

going in.

For a join clause, the amount of correlation between two columns (hard).

etc

Ideally, the planner could then use this info to make better plans.
Eventually, the whole system could become somewhat self-tuning.

Does anyone see any problems with this?

[ Discussion moved to hackers.]

I have thought that some type of feedback from the executor back into
the optimizer would be a good feature. Not sure how to do it, but your
idea makes sense. It certainly could update the table statistics after
a sequential scan.

--
Bruce Momjian                        |  http://candle.pha.pa.us
pgman@candle.pha.pa.us               |  (610) 853-3000
+  If your life is a hard drive,     |  830 Blythe Avenue
+  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

#13Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Christopher Kings-Lynne (#12)
Re: [GENERAL] Idea for the statistics collector

Christopher Kings-Lynne wrote:

I was thinking of writing a command line tool like 'pgtune' that looks at
the stats views and will generate SQL code for, or do automatically the
following:

* Dropping indices that are never used
* Creating appropriate indices to avoid large, expensive sequential scans.

This would put us in the 'mysql makes my indices for me by magic' league -
but would be far more powerful and flexible. How to do multikey indices is
beyond me tho.

This is a great idea. I have been wanting to do something like this
myself but probably won't get the time.

Does MySQL really make indexes by magic?

Also, I had to look up the contraction for "will not" because I always
get that confused (won't). I just found a web page on it:

http://www.straightdope.com/mailbag/mwont.html

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Christopher Kings-Lynne (#12)
Re: [GENERAL] Idea for the statistics collector

"Christopher Kings-Lynne" <chriskl@familyhealth.com.au> writes:

I was thinking of writing a command line tool like 'pgtune' that looks at
the stats views and will generate SQL code for, or do automatically the
following:

* Dropping indices that are never used
* Creating appropriate indices to avoid large, expensive sequential scans.

Dropping unused indices sounds good --- but beware of dropping unique
indexes; they may be there to enforce a constraint, and not because of
any desire to use them in queries.

I'm not sure how you're going to automatically intuit appropriate
indexes to add, though. You'd need to look at a suitable workload
(ie, a representative set of queries) which is not data that's readily
available from the stats views. Perhaps we could expect the DBA to
provide a segment of log output that includes debug_print_query
and show_query_stats results.

regards, tom lane

#15Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Doug Fields (#11)
Re: Idea for the statistics collector

Added to TODO list:

* Log queries where the optimizer row estimates were dramatically
different from the number of rows actually found (?)

---------------------------------------------------------------------------

Doug Fields wrote:

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Martijn van Oosterhout wrote:

Firstly, I was only thinking of going for the basic nodes (Index

Scan, Seq

Scan, Distinct). Other types have far more variables. Secondly, even

if you

only count, it's useful. For example, if it tells you that the

planner is

off by a factor of 10 more than 75% of the time, that's useful

information

independant of what the actual variables are.

And reduce the usefulness even more ;-). As a pure stats-gathering
exercise it might be worth doing, but not if you only log the failure
cases. How will you know how well you are doing if you take a
biased-by-design sample?

Personally, given that it seems like at least once or twice a day someone
asks about performance or "why isn't my index being used" and other stuff -
I think doing this would be a great idea.

Perhaps not necessarily in the full-fledged way, but creating a sort of
"ANALYZE log," wherein it logs the optimizer's estimate of a query and the
actual results of a query, for every query. This, of course, could be
enableable/disableable on a per-connection basis, per-table basis (like
OIDs), or whatever other basis makes life easiest to the developers.

Then, when the next ANALYZE is run, it could do it's usual analysis, and
apply some additional heuristics based upon what it learns from the
"ANALYZE log," possibly to do several things:

1) Automatically increase/decrease the SET STATISTICS information included
in the analyze, for example, increasing it as a table grows larger and the
"randomness" grows less than linearly with size (e.g., if you have 50 or 60
groups in a 1,000,000 row table, that certainly needs a higher SET
STATISTICS and I do it on my tables).
2) Have an additional value on the statistics table called the
"index_heuristic" or "random_page_adjustment_heuristic" which when 1 does
nothing, but otherwise modifies the cost of using an index/seq scan by that
factor - and don't ever change this more than a few percent each ANALYZE
3) Flags in a second log (maybe the regular log) really bad query estimates
- let it do an analysis of the queries and flag anything two or three std
deviations outside.

Now, I suggest all this stuff in the name of usability and
self-maintainability. Unfortunately, I don't have the wherewithal to
actually assist in development.

Another possibility is to put "use_seq_scan" default to OFF, or whatever
the parameter is (I did my optimizing a while ago so it's fading), so that
if there's an index, it will use it, regardless - as this seems to be what
the great majority of people expect to happen. And/or add this to a FAQ,
and let us all reply "see http://.../indexfaq.html.&quot; :)

Cheers,

Doug

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#16Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Christopher Kings-Lynne (#12)
Re: [GENERAL] Idea for the statistics collector

Added to TODO:

* Add tool to query pg_stat_* tables and report indexes that aren't needed
or tables that might need indexes

---------------------------------------------------------------------------

Christopher Kings-Lynne wrote:

I was thinking of writing a command line tool like 'pgtune' that looks at
the stats views and will generate SQL code for, or do automatically the
following:

* Dropping indices that are never used
* Creating appropriate indices to avoid large, expensive sequential scans.

This would put us in the 'mysql makes my indices for me by magic' league -
but would be far more powerful and flexible. How to do multikey indices is
beyond me tho.

*sigh* I'm recovering from a septoplasty on my nose atm, so I might have
some time to do some coding!

Chris

----- Original Message -----
From: "Bruce Momjian" <pgman@candle.pha.pa.us>
To: "Martijn van Oosterhout" <kleptog@svana.org>
Cc: "PostgreSQL-development" <pgsql-hackers@postgresql.org>
Sent: Friday, June 21, 2002 10:50 AM
Subject: Re: [HACKERS] [GENERAL] Idea for the statistics collector

Martijn van Oosterhout wrote:

Since it's currently all for collecting statistics on tables, why can't

it

collect another type of statistic, like:

- How often the estimator gets it wrong?

At the end of an index scan, the executor could compare the number of

rows

returned against what was estimated, and if it falls outside a certain
range, flag it.

Also, the average ratio of rows coming out of a distinct node vs the

number

going in.

For a join clause, the amount of correlation between two columns (hard).

etc

Ideally, the planner could then use this info to make better plans.
Eventually, the whole system could become somewhat self-tuning.

Does anyone see any problems with this?

[ Discussion moved to hackers.]

I have thought that some type of feedback from the executor back into
the optimizer would be a good feature. Not sure how to do it, but your
idea makes sense. It certainly could update the table statistics after
a sequential scan.

--
Bruce Momjian                        |  http://candle.pha.pa.us
pgman@candle.pha.pa.us               |  (610) 853-3000
+  If your life is a hard drive,     |  830 Blythe Avenue
+  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#17Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Bruce Momjian (#16)
Re: [GENERAL] Idea for the statistics collector

Christopher Kings-Lynne wrote:

Good god - how old was that email? 2002???

Yep, and been in my mailbox since then, waiting for me to process it
into a TODO entry.

---------------------------------------------------------------------------

Chris

Bruce Momjian wrote:

Added to TODO:

* Add tool to query pg_stat_* tables and report indexes that aren't needed
or tables that might need indexes

---------------------------------------------------------------------------

Christopher Kings-Lynne wrote:

I was thinking of writing a command line tool like 'pgtune' that looks at
the stats views and will generate SQL code for, or do automatically the
following:

* Dropping indices that are never used
* Creating appropriate indices to avoid large, expensive sequential scans.

This would put us in the 'mysql makes my indices for me by magic' league -
but would be far more powerful and flexible. How to do multikey indices is
beyond me tho.

*sigh* I'm recovering from a septoplasty on my nose atm, so I might have
some time to do some coding!

Chris

----- Original Message -----
From: "Bruce Momjian" <pgman@candle.pha.pa.us>
To: "Martijn van Oosterhout" <kleptog@svana.org>
Cc: "PostgreSQL-development" <pgsql-hackers@postgresql.org>
Sent: Friday, June 21, 2002 10:50 AM
Subject: Re: [HACKERS] [GENERAL] Idea for the statistics collector

Martijn van Oosterhout wrote:

Since it's currently all for collecting statistics on tables, why can't

it

collect another type of statistic, like:

- How often the estimator gets it wrong?

At the end of an index scan, the executor could compare the number of

rows

returned against what was estimated, and if it falls outside a certain
range, flag it.

Also, the average ratio of rows coming out of a distinct node vs the

number

going in.

For a join clause, the amount of correlation between two columns (hard).

etc

Ideally, the planner could then use this info to make better plans.
Eventually, the whole system could become somewhat self-tuning.

Does anyone see any problems with this?

[ Discussion moved to hackers.]

I have thought that some type of feedback from the executor back into
the optimizer would be a good feature. Not sure how to do it, but your
idea makes sense. It certainly could update the table statistics after
a sequential scan.

--
Bruce Momjian                        |  http://candle.pha.pa.us
pgman@candle.pha.pa.us               |  (610) 853-3000
+  If your life is a hard drive,     |  830 Blythe Avenue
+  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#18Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Bruce Momjian (#16)
Re: [GENERAL] Idea for the statistics collector

Good god - how old was that email? 2002???

Chris

Bruce Momjian wrote:

Show quoted text

Added to TODO:

* Add tool to query pg_stat_* tables and report indexes that aren't needed
or tables that might need indexes

---------------------------------------------------------------------------

Christopher Kings-Lynne wrote:

I was thinking of writing a command line tool like 'pgtune' that looks at
the stats views and will generate SQL code for, or do automatically the
following:

* Dropping indices that are never used
* Creating appropriate indices to avoid large, expensive sequential scans.

This would put us in the 'mysql makes my indices for me by magic' league -
but would be far more powerful and flexible. How to do multikey indices is
beyond me tho.

*sigh* I'm recovering from a septoplasty on my nose atm, so I might have
some time to do some coding!

Chris

----- Original Message -----
From: "Bruce Momjian" <pgman@candle.pha.pa.us>
To: "Martijn van Oosterhout" <kleptog@svana.org>
Cc: "PostgreSQL-development" <pgsql-hackers@postgresql.org>
Sent: Friday, June 21, 2002 10:50 AM
Subject: Re: [HACKERS] [GENERAL] Idea for the statistics collector

Martijn van Oosterhout wrote:

Since it's currently all for collecting statistics on tables, why can't

it

collect another type of statistic, like:

- How often the estimator gets it wrong?

At the end of an index scan, the executor could compare the number of

rows

returned against what was estimated, and if it falls outside a certain
range, flag it.

Also, the average ratio of rows coming out of a distinct node vs the

number

going in.

For a join clause, the amount of correlation between two columns (hard).

etc

Ideally, the planner could then use this info to make better plans.
Eventually, the whole system could become somewhat self-tuning.

Does anyone see any problems with this?

[ Discussion moved to hackers.]

I have thought that some type of feedback from the executor back into
the optimizer would be a good feature. Not sure how to do it, but your
idea makes sense. It certainly could update the table statistics after
a sequential scan.

--
Bruce Momjian                        |  http://candle.pha.pa.us
pgman@candle.pha.pa.us               |  (610) 853-3000
+  If your life is a hard drive,     |  830 Blythe Avenue
+  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

#19Andreas Pflug
pgadmin@pse-consulting.de
In reply to: Bruce Momjian (#17)
Re: [GENERAL] Idea for the statistics collector

Bruce Momjian wrote:

Christopher Kings-Lynne wrote:

Good god - how old was that email? 2002???

Yep, and been in my mailbox since then, waiting for me to process it
into a TODO entry.

Exciting what one can find wiping the floor of the mailbox :-)

Regards,
Andreas

#20Josh Berkus
josh@agliodbs.com
In reply to: Andreas Pflug (#19)
Re: [GENERAL] Idea for the statistics collector

Bruce,

Good god - how old was that email? 2002???

Yep, and been in my mailbox since then, waiting for me to process it
into a TODO entry.

    * Add tool to query pg_stat_* tables and report indexes that aren't
needed or tables that might need indexes

pg_stat_user_indexes and pg_stat_user_tables aready give us this information.
Further, the newsysviews are going to fill in for the remaining missing info
(like unindexed FKs).

What other tools do we need? If we're talking an interactive analysis tool
ala pg_suggest, isn't that being worked on .... as an *add-in*? I don't see
any place for this in the core ...

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

#21Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Josh Berkus (#20)
Re: [GENERAL] Idea for the statistics collector

Josh Berkus wrote:

Bruce,

Good god - how old was that email? 2002???

Yep, and been in my mailbox since then, waiting for me to process it
into a TODO entry.

????* Add tool to query pg_stat_* tables and report indexes that aren't
needed or tables that might need indexes

pg_stat_user_indexes and pg_stat_user_tables aready give us this information.
Further, the newsysviews are going to fill in for the remaining missing info
(like unindexed FKs).

What other tools do we need? If we're talking an interactive analysis tool
ala pg_suggest, isn't that being worked on .... as an *add-in*? I don't see
any place for this in the core ...

It is something need, either in core or out. Is it done enough to
remove the item?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#22Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Bruce Momjian (#21)
Re: [GENERAL] Idea for the statistics collector

What other tools do we need? If we're talking an interactive analysis tool
ala pg_suggest, isn't that being worked on .... as an *add-in*? I don't see
any place for this in the core ...

It is something need, either in core or out. Is it done enough to
remove the item?

Well the pg_advisor views in pgfoundry.org are supposed to provide this.
I'll probably ask Fabien to give me commits on it so i can put stuff
in that :)

Chris

#23Josh Berkus
josh@agliodbs.com
In reply to: Christopher Kings-Lynne (#22)
Re: [GENERAL] Idea for the statistics collector

Chris,

Well the pg_advisor views in pgfoundry.org are supposed to provide this.
I'll probably ask Fabien to give me commits on it so i can put stuff
in that :)

If someone is still working on pg_advisor, suggest that they check out our
work in newsysviews. I think it'll speed things up considerably if they/you
build on our shoulders.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco