raising the default default_statistics_target
From time to time, people on IRC ask for help with performance
problems, and the cause of the difficulty is ultimately traced to a
poor query plan that is chosen because default_statistics_target is
too low. While there will always need to be *some* tuning of the
statistics target by advanced users, I wanted to see what the
performance penalty would be to increase the default stats target out
of the box -- that way, more users will get good query plans without
needing to manually tweak the configuration.
In the simple test I performed, raising the default_statistics_target
from 10 to 25 resulted in a 40% increase in the time to ANALYZE a
large table. (I picked 25 more or less at random -- would 15 or 20 be
better?)
That's a larger hit than I was hoping to see; however, it can be
argued that ANALYZE isn't really performance-critical anyway (since it
doesn't hold the same kind of locks that VACUUM and especially VACUUM
FULL hold). Also, I only have anecdotal evidence that this is actually
a problem. It may also be the case that for those people for whom 10
is an insufficient stats target, 25 is also insufficient.
Any comments on whether increasing the default stats target is a good
idea for 7.5? (Details on the test I performed are included below)
-Neil
I created a 2.1 GB table with 3 columns (int, varchar, and float):
nconway=# select relpages from pg_class where relname = 'abc';
relpages
----------
279621
(1 row)
nconway=# select reltuples from pg_class where relname = 'abc';
reltuples
-------------
3.35545e+07
(1 row)
I tested two default_statistcs_target settings: 10 (the current
default), and 25. The test machine is a P4 1.8 Ghz with 768 MB of RAM
and a pretty mediocre 7200 RPM IDE disk running Linux 2.6.3. I
rebooted the machine before and between tests.
ANALYZE w/ stats target = 10: 51.643 seconds
ANALYZE w/ stats target = 25: 71.969 seconds
(Additional tests performed w/o rebooting seem to be consistent with
these numbers.)
Neil Conway <neilc@samurai.com> writes:
Any comments on whether increasing the default stats target is a good
idea for 7.5? (Details on the test I performed are included below)
This is something we need to consider, but we'll need more evidence
before making a choice. One thing that we have very little data about
is how much difference it makes in the quality of planner choices.
(There's no point in slowing down ANALYZE unless the plans get better.)
Also, I would expect that larger stats targets would slow down the parts
of the planner that look at the stats, since there are more data values
to examine. I do not have any numbers about this cost though --- do you
want to try to get some?
regards, tom lane
Tom Lane wrote:
This is something we need to consider, but we'll need more evidence
before making a choice. One thing that we have very little data about
is how much difference it makes in the quality of planner choices.
Right, but is there a practical way to actually get this data?
If the distribution of data in the table is irregular, a higher stats
target is needed to allow good planning choices. Therefore, the effect
that the stats target has on planner choices depends on the regularity
of the distribution of data at installations, and there is no way to
know that in general AFAICS.
Also, I would expect that larger stats targets would slow down the parts
of the planner that look at the stats, since there are more data values
to examine. I do not have any numbers about this cost though --- do you
want to try to get some?
Given the magnitude of the change (25 data elements versus 10), I
wouldn't expect this to produce a major change in the total runtime of
the optimizer. However, I don't know the optimizer that well, so I'll
do some benchmarks when I get a chance.
-Neil
Neil Conway <neilc@samurai.com> writes:
Tom Lane wrote:
This is something we need to consider, but we'll need more evidence
before making a choice. One thing that we have very little data about
is how much difference it makes in the quality of planner choices.
Right, but is there a practical way to actually get this data?
I haven't thought of one yet, but perhaps someone will have an idea.
Also, I would expect that larger stats targets would slow down the parts
of the planner that look at the stats, since there are more data values
to examine. I do not have any numbers about this cost though --- do you
want to try to get some?
Given the magnitude of the change (25 data elements versus 10), I
wouldn't expect this to produce a major change in the total runtime of
the optimizer.
I wouldn't either, but if we need to raise the stats target to 100 or
1000 to make a meaningful difference, then the question becomes more
urgent.
regards, tom lane
Neil,
In the simple test I performed, raising the default_statistics_target
from 10 to 25 resulted in a 40% increase in the time to ANALYZE a
large table. (I picked 25 more or less at random -- would 15 or 20 be
better?)
I find that very interesting, since I haven't found much higher increases to
be a proportionate penality. For example, on an 11-column table raising 3
columns to statistics=250 merely doubled the ANALYZE time. I have not done
exact timing, but would be happy to ....
It may also be the case that for those people for whom 10
is an insufficient stats target, 25 is also insufficient.
It is. I've found that "problem" queries, especially those caused by real,
uneven distribution of data, require raising statistics to 150-400 in order
to fix. This is much to high a level to assign as a default.
Any comments on whether increasing the default stats target is a good
idea for 7.5? (Details on the test I performed are included below)
No. I don't think it's a good idea to raise the default for *all* columns;
for one thing, I'd really hate to think what, say, a default stats of 100
would do to a TEXT column with an average of 8K of data per row.
Further, in 7.5 we'll be introducing correlated stats for multi-column indexes
(unless something's gone off with that?) which should help a lot of problem
queries. And change our whole emphasis on brute forcing analyze through
increasing stats into the 100's.
If you really want to tackle this issue, though, here's what I suggest:
1) add a GUC called default_statistics_indexed, which starts at say 100 or 50.
2) When ever the user indexes a column, automatically increase the stats
to the level in default_statistics_indexed, if they are at the level in
default_statistics_target.
This will then give indexed columns "automatically" a somewhat higher level of
stats analysis than other columns. This should help a lot of "slow query"
problems, yet effectively leave the selection of "important" columns in the
hands of the DBA. Make sense?
Also, another great feature in this department would be to extend the
multi-column correlation statistics to cover foriegn keys, as a way of
improving cross-table estimates.
Anyway, keep me in the loop on this, I have a lot of very complex databases I
can test such issues on.
--
-Josh Berkus
Aglio Database Solutions
San Francisco
Import Notes
Reply to msg id not found: auto-000004581775@davinci.ethosmedia.comReference msg id not found: auto-000004581775@davinci.ethosmedia.com | Resolved by subject fallback
Josh Berkus <josh@agliodbs.com> writes:
It is. I've found that "problem" queries, especially those caused by real,
uneven distribution of data, require raising statistics to 150-400 in order
to fix. This is much to high a level to assign as a default.
That's basically what's bothering me about the suggestion to increase to
25 --- I'm dubious that it will do any good.
Further, in 7.5 we'll be introducing correlated stats for multi-column indexes
(unless something's gone off with that?)
News to me. It's certainly not there now.
This will then give indexed columns "automatically" a somewhat higher
level of stats analysis than other columns.
That is potentially a good idea. There's still the question of what is
a reasonable default, though.
regards, tom lane
Tom Lane wrote:
Josh Berkus <josh@agliodbs.com> writes:
It is. I've found that "problem" queries, especially those caused by real,
uneven distribution of data, require raising statistics to 150-400 in order
to fix. This is much to high a level to assign as a default.That's basically what's bothering me about the suggestion to increase to
25 --- I'm dubious that it will do any good.Further, in 7.5 we'll be introducing correlated stats for multi-column indexes
(unless something's gone off with that?)News to me. It's certainly not there now.
This will then give indexed columns "automatically" a somewhat higher
level of stats analysis than other columns.That is potentially a good idea. There's still the question of what is
a reasonable default, though.
Do all the columns have to have the same number of statistics buckets?
Could that stats collector adjust the number of buckets based on the
data somehow?
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Do all the columns have to have the same number of statistics buckets?
They do not, but the effort spent by ANALYZE is proportional to the
largest stats target among all the columns of the table.
regards, tom lane
On 7 Mar, Tom Lane wrote:
Neil Conway <neilc@samurai.com> writes:
Tom Lane wrote:
This is something we need to consider, but we'll need more evidence
before making a choice. One thing that we have very little data about
is how much difference it makes in the quality of planner choices.Right, but is there a practical way to actually get this data?
I haven't thought of one yet, but perhaps someone will have an idea.
If DBT-3 is an appropriate test, we can get those EXPLAIN ANALYZE
changes I have working in the STP version and queued up tests varying
the default_statistics_target parameter.
Mark
Tom,
Further, in 7.5 we'll be introducing correlated stats for multi-column
indexes
(unless something's gone off with that?)
This was discussed on Hackers in October, a complete implementation was shown,
I thought it was committed at that time. If not, what happened to it?
Dammit, it's impossible to find anything in the archives if you don't have
some good keywords or at least the author. Is the autor reading this? Will
you speak up?
--
-Josh Berkus
Aglio Database Solutions
San Francisco
Josh Berkus <josh@agliodbs.com> writes:
Further, in 7.5 we'll be introducing correlated stats for multi-column
indexes (unless something's gone off with that?)
This was discussed on Hackers in October, a complete implementation was shown,
I thought it was committed at that time. If not, what happened to it?
Are you sure you're not thinking of stats for functional indexes?
regards, tom lane
Tom Lane wrote:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Do all the columns have to have the same number of statistics buckets?
They do not, but the effort spent by ANALYZE is proportional to the
largest stats target among all the columns of the table.
Could we use previous stats to determine how many buckets to use when
running ANALYZE. Also, if columns have a different number of buckets,
does that mean that we don't have the same per-query overhead for a
larger stats target?
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073
Tom,
Are you sure you're not thinking of stats for functional indexes?
Positive. I even remember seeing that the patch was accepted.
The patch specifically had to do with a multi-column correlation algorithm for
improving the selectivity of multi-column indexes.
Problem is, with 1400 posts per month August to October, I can't find it, and
the keywords that I think are obvious don't turn anything up.
--
-Josh Berkus
Aglio Database Solutions
San Francisco
I think the thread you're thinking of is on or about this post:
http://archives.postgresql.org/pgsql-patches/2003-12/msg00039.php
Manfred Koizar produced a patch that modified index correlation by
sorting equal key values based on item pointers. The patch went as far
as getting accepted into the patch queue, but Tom raised some doubts
about it and it was subsequently removed.
Robert Treat
On Mon, 2004-03-08 at 14:41, Josh Berkus wrote:
Tom,
Are you sure you're not thinking of stats for functional indexes?
Positive. I even remember seeing that the patch was accepted.
The patch specifically had to do with a multi-column correlation algorithm for
improving the selectivity of multi-column indexes.Problem is, with 1400 posts per month August to October, I can't find it, and
the keywords that I think are obvious don't turn anything up.--
-Josh Berkus
Aglio Database Solutions
San Francisco---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
--
Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL
Robert Treat <xzilla@users.sourceforge.net> writes:
I think the thread you're thinking of is on or about this post:
http://archives.postgresql.org/pgsql-patches/2003-12/msg00039.php
Manfred Koizar produced a patch that modified index correlation by
sorting equal key values based on item pointers. The patch went as far
as getting accepted into the patch queue, but Tom raised some doubts
about it and it was subsequently removed.
Hm, that had nothing to do with multi-column correlation though.
I'm at a loss to think of any work that matches with Josh's
recollection.
regards, tom lane
Guys,
Hm, that had nothing to do with multi-column correlation though.
I'm at a loss to think of any work that matches with Josh's
recollection.
Hmmmm .... it's possible that early e-mails about Manfred's patch claimed to
improve performance for multi-column indexes.
But it's also possible I'm remembering something else.
Darn it, though! 'cause multi-column correlation is one of our big issues on
estimates for complex queries.
--
-Josh Berkus
Aglio Database Solutions
San Francisco
On Sun, 7 Mar 2004, Tom Lane wrote:
Neil Conway <neilc@samurai.com> writes:
Tom Lane wrote:
This is something we need to consider, but we'll need more evidence
before making a choice. One thing that we have very little data about
is how much difference it makes in the quality of planner choices.Right, but is there a practical way to actually get this data?
I haven't thought of one yet, but perhaps someone will have an idea.
Hi Tom. I ran some very simple tests on analyze times and query plan
times on a very simple table, with data randomly distributed. The index
was on a date field, since that's what I was testing last.
This was all done on my 512Meg memory 1.1GHz celeron workstation with an
IDE drive. I'd love more input on better testing methodologies here...
with 100k or 1M rows that look kinda like this: (I'll test 10M rows
later, which means the dataset won't fit in memory, so there'll be lots of
access going on. Right now the 1M row table is 80 meg)
select * from test2 limit 5;
info | dt | id
---------------------------------------------+---------------------+---------
Francize perfectible swirling fluctuates | 2004-05-20 20:12:04 | 2721995
Fields chauffeur attentionality grandmother | 2004-04-07 14:36:02 | 2721996
Belgium bilked explosively defendant | 2004-09-16 16:27:22 | 2721997
perspectives Buenos Pollux discriminates | 2004-11-11 12:28:31 | 2721998
Victorianize Savonarola blackmails sufficed | 2004-02-27 21:17:20 | 2721999
(5 rows)
here's what I get with different statistics targets for analyze times:
100k 1M 1M
analyze analyze plan
target ms ms ms
10 250 875 2
20 350 1250
30 430 1500
40 520 1725
50 580 1900
60 690 2100
70 775 2175
80 850 2300
90 950 2400
100 1000 2600 2.5
200 1806 3700
300 2600 4800
400 2600 5900
500 2600 7200
700 2600 9500
1000 2600 13000 5
Since this data is randomly distributed, I didn't bother doing a lot of
testing to see how accurate each target setting was. If that would be
useful to know I'd gladly test it, but I was only setting out to test the
time to analyze and the time to plan.
Note that I only tested 3 targets for planning time, as it didn't seem to
make a very big difference. The query was:
select * from test2 where dt between 'march 11, 2004' and 'march 13, 2004';
I also ran some quick tests on smaller tables (1000 and 10k rows) and
there, the plateau that we see in the 100k analyze shows up much quicker,
at something like 50 or so. I.e. the analyze time flattened out quickly
and higher numbers cost very little if anything.
Since this query was quite an easy plan, I'd expect to need a much more
complex one to test the increase in planning time, say something that has
to look at a lot of statistics. Any particular join type or something
that's likely to do that?
"scott.marlowe" <scott.marlowe@ihs.com> writes:
Hi Tom. I ran some very simple tests on analyze times and query plan
times on a very simple table, with data randomly distributed. The index
was on a date field, since that's what I was testing last.
Thanks.
I also ran some quick tests on smaller tables (1000 and 10k rows) and
there, the plateau that we see in the 100k analyze shows up much quicker,
at something like 50 or so. I.e. the analyze time flattened out quickly
and higher numbers cost very little if anything.
The sample size is (IIRC) 300 times stats_target rows, so the "plateau"
that you're seeing occurs when the sample size becomes the entire table.
It would be useful to note how large the ANALYZE process got to be during
these runs.
Since this query was quite an easy plan, I'd expect to need a much more
complex one to test the increase in planning time, say something that has
to look at a lot of statistics. Any particular join type or something
that's likely to do that?
I'd say try a join on any reasonably plausible foreign-key relationship
(unique key on one side, not-unique data on the other). That's probably
the most common situation. As for making it complicated, just stack up
a bunch of such joins ...
regards, tom lane