benchmarking the query planner (was Re: Simple postgresql.conf wizard)

Started by Robert Haasover 17 years ago93 messageshackers
Jump to latest
#1Robert Haas
robertmhaas@gmail.com

Sorry for top posting but we are getting a bit far afield from the
original topic. I followed up the tests I did last night:

http://archives.postgresql.org/pgsql-hackers/2008-12/msg00369.php

I benchmarked 100 iterations of EXPLAIN on the query Greg Stark put
together as a synthetic benchmark for default_statistics_target with
various values for "SET STATISTICS n". Testing was done on CVS HEAD
on my laptop with no configure options other than --prefix. Then I
did this, to disable compression on pg_statistic.

alter table pg_statistic alter column stanumbers1 set storage external;
alter table pg_statistic alter column stanumbers2 set storage external;
alter table pg_statistic alter column stanumbers3 set storage external;
alter table pg_statistic alter column stanumbers4 set storage external;
alter table pg_statistic alter column stavalues1 set storage external;
alter table pg_statistic alter column stavalues2 set storage external;
alter table pg_statistic alter column stavalues3 set storage external;
alter table pg_statistic alter column stavalues4 set storage external;

(Note that you'll need to put allow_system_table_mods=true in your
postgresql.conf file if you want this to work.) Then I reran the
tests. The results were pretty dramatic. In the table below, the
first column is value of "SET STATISTICS n" that was performed the
table column prior to analyzing it. The second column is the time
required to plan the query 100x AFTER disabling compression on
pg_statistic, and the third column is the time required to plan the
query 100x BEFORE disabling compression on pg_statistic.

10 0.829202 0.8249
20 1.059976 1.06957
30 1.168727 1.143803
40 1.287189 1.263252
50 1.370167 1.363951
60 1.486589 1.460464
70 1.603899 1.571107
80 1.69402 1.689651
90 1.79068 1.804454
100 1.930877 2.803941
150 2.446471 4.833002
200 2.95323 6.217708
250 3.436741 7.507919
300 3.983568 8.895015
350 4.497475 10.201713
400 5.072471 11.576961
450 5.615272 12.933128
500 6.286358 14.408157
550 6.895951 15.745378
600 7.400134 17.192916
650 8.038159 18.568616
700 8.606704 20.025952
750 9.154889 21.45775
800 9.80953 22.74635
850 10.363471 24.057379
900 11.022348 25.559911
950 11.69732 27.021034
1000 12.266699 28.711027

As you can see, for default_statistics_target > 90, this is a HUGE win.

After doing this test, I rebuilt with --enable-profiling and profiled
EXPLAIN 10x with SET STATISTICS 10, 70, 100, 200 with a vanilla
configuration, and then 200 again with compression turned off as
described above. The, ahem, ridiculously small limit on attachment
size prevents me from attaching the full results, so please see the
attached results which are truncated after the first section. 10x
doesn't seem to be quite enough to get the exact picture of where the
bottlenecks are, but the overall picture is clear enough:
decompression introduces a huge overhead.

Looking specifically at the 200-decompress output, the next biggest
hit is AllocSetAlloc(), which, from the detailed results that I
unfortunately can't include, is being called mostly by datumCopy()
which is being called mostly by get_attstatsslot(). There are 4000
calls to get_attstatsslot() which result 701,500 calls to datumCopy().

I'm not too sure what any of this means in terms of optimizatiion,
other than that changing the storage type of pg_statistic columns to
external looks like a huge win. Perhaps someone more knowledgeable
than I has some thoughts.

...Robert

Attachments:

gmon-summary.tbzapplication/x-bzip-compressed-tar; name=gmon-summary.tbzDownload
#2Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#1)
Re: benchmarking the query planner (was Re: Simple postgresql.conf wizard)

That might only be the case when the pg_statistic record is in shared
buffers.

Also I wonder if eqjoinsel and company might need to be made more
toast-aware by detoasring all the things it needs once rather than
every time it accesses them.

greg

On 6 Dec 2008, at 06:19 PM, "Robert Haas" <robertmhaas@gmail.com> wrote:

Show quoted text

Sorry for top posting but we are getting a bit far afield from the
original topic. I followed up the tests I did last night:

http://archives.postgresql.org/pgsql-hackers/2008-12/msg00369.php

I benchmarked 100 iterations of EXPLAIN on the query Greg Stark put
together as a synthetic benchmark for default_statistics_target with
various values for "SET STATISTICS n". Testing was done on CVS HEAD
on my laptop with no configure options other than --prefix. Then I
did this, to disable compression on pg_statistic.

alter table pg_statistic alter column stanumbers1 set storage
external;
alter table pg_statistic alter column stanumbers2 set storage
external;
alter table pg_statistic alter column stanumbers3 set storage
external;
alter table pg_statistic alter column stanumbers4 set storage
external;
alter table pg_statistic alter column stavalues1 set storage external;
alter table pg_statistic alter column stavalues2 set storage external;
alter table pg_statistic alter column stavalues3 set storage external;
alter table pg_statistic alter column stavalues4 set storage external;

(Note that you'll need to put allow_system_table_mods=true in your
postgresql.conf file if you want this to work.) Then I reran the
tests. The results were pretty dramatic. In the table below, the
first column is value of "SET STATISTICS n" that was performed the
table column prior to analyzing it. The second column is the time
required to plan the query 100x AFTER disabling compression on
pg_statistic, and the third column is the time required to plan the
query 100x BEFORE disabling compression on pg_statistic.

10 0.829202 0.8249
20 1.059976 1.06957
30 1.168727 1.143803
40 1.287189 1.263252
50 1.370167 1.363951
60 1.486589 1.460464
70 1.603899 1.571107
80 1.69402 1.689651
90 1.79068 1.804454
100 1.930877 2.803941
150 2.446471 4.833002
200 2.95323 6.217708
250 3.436741 7.507919
300 3.983568 8.895015
350 4.497475 10.201713
400 5.072471 11.576961
450 5.615272 12.933128
500 6.286358 14.408157
550 6.895951 15.745378
600 7.400134 17.192916
650 8.038159 18.568616
700 8.606704 20.025952
750 9.154889 21.45775
800 9.80953 22.74635
850 10.363471 24.057379
900 11.022348 25.559911
950 11.69732 27.021034
1000 12.266699 28.711027

As you can see, for default_statistics_target > 90, this is a HUGE
win.

After doing this test, I rebuilt with --enable-profiling and profiled
EXPLAIN 10x with SET STATISTICS 10, 70, 100, 200 with a vanilla
configuration, and then 200 again with compression turned off as
described above. The, ahem, ridiculously small limit on attachment
size prevents me from attaching the full results, so please see the
attached results which are truncated after the first section. 10x
doesn't seem to be quite enough to get the exact picture of where the
bottlenecks are, but the overall picture is clear enough:
decompression introduces a huge overhead.

Looking specifically at the 200-decompress output, the next biggest
hit is AllocSetAlloc(), which, from the detailed results that I
unfortunately can't include, is being called mostly by datumCopy()
which is being called mostly by get_attstatsslot(). There are 4000
calls to get_attstatsslot() which result 701,500 calls to datumCopy().

I'm not too sure what any of this means in terms of optimizatiion,
other than that changing the storage type of pg_statistic columns to
external looks like a huge win. Perhaps someone more knowledgeable
than I has some thoughts.

...Robert
<gmon-summary.tbz>

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#2)
Re: benchmarking the query planner (was Re: Simple postgresql.conf wizard)

Greg Stark <greg.stark@enterprisedb.com> writes:

That might only be the case when the pg_statistic record is in shared
buffers.

Yeah, it seems unlikely that disabling compression is a good idea in the
bigger scheme of things.

Also I wonder if eqjoinsel and company might need to be made more
toast-aware by detoasring all the things it needs once rather than
every time it accesses them.

At least in this example, the same stats arrays are fetched multiple
times per query. It might be worth teaching the planner to cache those
results internally during a plan cycle --- this would avoid most of the
complexity associated with caching (ie, figuring out when to clear the
cache) and still get maybe a 2x or 3x or so win.

regards, tom lane

#4Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#3)
Re: benchmarking the query planner (was Re: Simple postgresql.conf wizard)

On Mon, Dec 8, 2008 at 11:56 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Greg Stark <greg.stark@enterprisedb.com> writes:

That might only be the case when the pg_statistic record is in shared
buffers.

Yeah, it seems unlikely that disabling compression is a good idea in the
bigger scheme of things.

Is there any way to figure out what sort of compression ratios we're
typically getting? The only way compression can really be a win is if
it's saving us having to read in additional pages.

Even then, I'm not sure we should worry about needing to read in
additional pages in this case. I would sure expect statistics to stay
in shared buffers, or at least in the operating system cache. Sure,
if the table is accessed VERY infrequently, that might not be the
case, but is that really what we want to optimize for?

...Robert

#5Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#1)
Re: benchmarking the query planner

On Mon, Dec 8, 2008 at 10:24 AM, Gregory Stark <stark@enterprisedb.com> wrote:

I tried a different query, trying to get quadratic growth and again failed. It

The profiling results I sent the other day show an exactly-linear
increase in the number of times eqjoinsel invokes FunctionCall2.
Reading through the the eqjoinsel_inner loop in selfuncs.c beginning
around line 2042, I think what is happening is this: since the two
tables are really the same table, nvalues1 and nvalues2 are the same
array, and therefore contain the same elements in the same order. As
a result, for each i, we skip over the first i - 1 entries in
nvalues2, which have already been matched, and then compare element i
of nvalues1 to element i of nvalues2 and mark them both as matched.

Although this is technically an O(n^2) algorithm, the constant is very
low, because this code is Really Fast:

if (hasmatch[j])
continue;

To get observable quadratic behavior, I think you might need to
construct two MCV arrays where all the values are the same, but the
arrays are not in the same order. I believe they are sorted by
frequency, so it might be sufficient to arrange things so that the
N'th most common value in one table is the (statistics_target + 1 -
N)'th most common value in the other. It might also help to use a
function with a real slow comparison function (perhaps one
intentionally constructed to be slow).

...Robert

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#5)
Re: benchmarking the query planner

"Robert Haas" <robertmhaas@gmail.com> writes:

On Mon, Dec 8, 2008 at 10:24 AM, Gregory Stark <stark@enterprisedb.com> wrote:

I tried a different query, trying to get quadratic growth and again failed. It

The profiling results I sent the other day show an exactly-linear
increase in the number of times eqjoinsel invokes FunctionCall2.
Reading through the the eqjoinsel_inner loop in selfuncs.c beginning
around line 2042, I think what is happening is this: since the two
tables are really the same table, nvalues1 and nvalues2 are the same
array, and therefore contain the same elements in the same order.

Yeah, that would be fast. To see a quadratic case you need MCV arrays
that have little or no overlap of common values --- then each element of
the first will be compared (in vain) to all or most of the elements in
the second.

regards, tom lane

#7Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#6)
Re: benchmarking the query planner

Yeah, that would be fast. To see a quadratic case you need MCV arrays
that have little or no overlap of common values --- then each element of
the first will be compared (in vain) to all or most of the elements in
the second.

Ah, that makes sense. Here's a test case based on Greg's. This is
definitely more than linear once you get above about n = 80, but it's
not quadratic either. n = 1000 is only 43x n = 80, and while that's
surely more than 1000/80 = 12.5, it's also a lot less than (1000/80)^2
= 156.25.

create table tk as select
random()::text||random()::text||random()::text||random()::text||random()::text||random()::text
as r from generate_series(1,1000);
insert into tk (select * from tk);
insert into tk (select * from tk);
insert into tk (select
random()::text||random()::text||random()::text||random()::text||random()::text||random()::text
as r from generate_series(1,2000));

create table tk2 as select
random()::text||random()::text||random()::text||random()::text||random()::text||random()::text
as r from generate_series(1,1000);
insert into tk2 (select * from tk2);
insert into tk2 (select * from tk2);
insert into tk2 (select
random()::text||random()::text||random()::text||random()::text||random()::text||random()::text
as r from generate_series(1,2000));

create table tk3 as select
random()::text||random()::text||random()::text||random()::text||random()::text||random()::text
as r from generate_series(1,1000);
insert into tk3 (select * from tk3);
insert into tk3 (select * from tk3);
insert into tk3 (select
random()::text||random()::text||random()::text||random()::text||random()::text||random()::text
as r from generate_series(1,2000));

create table tk4 as select
random()::text||random()::text||random()::text||random()::text||random()::text||random()::text
as r from generate_series(1,1000);
insert into tk4 (select * from tk4);
insert into tk4 (select * from tk4);
insert into tk4 (select
random()::text||random()::text||random()::text||random()::text||random()::text||random()::text
as r from generate_series(1,2000));

create table tk5 as select
random()::text||random()::text||random()::text||random()::text||random()::text||random()::text
as r from generate_series(1,1000);
insert into tk5 (select * from tk5);
insert into tk5 (select * from tk5);
insert into tk5 (select
random()::text||random()::text||random()::text||random()::text||random()::text||random()::text
as r from generate_series(1,2000));

create table tk6 as select
random()::text||random()::text||random()::text||random()::text||random()::text||random()::text
as r from generate_series(1,1000);
insert into tk6 (select * from tk6);
insert into tk6 (select * from tk6);
insert into tk6 (select
random()::text||random()::text||random()::text||random()::text||random()::text||random()::text
as r from generate_series(1,2000));

and then (after disabling autovacuum):

set default_statistics_target = XXX;
analyze;
repeat 100x: explain select count(*) from (select * from tk as k, tk2
as l,tk3 as m,tk4 as n,tk5 as o,tk6 as p where k.r=l.r and k.r=m.r and
k.r=n.r and k.r=o.r and k.r=p.r) as x;

Timings (for 100 iterations):

10 0.900309
20 1.189229
30 1.280892
40 1.447358
50 1.611779
60 1.795701
70 2.001245
80 2.286144
90 2.955732
100 3.925557
150 6.472436
200 9.010824
250 11.89753
300 15.109172
350 18.813514
400 22.901383
450 27.842019
500 32.02136
550 37.609196
600 42.894322
650 48.460327
700 55.169819
750 61.568125
800 68.222201
850 75.027591
900 82.918344
950 91.235267
1000 99.737802

...Robert

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#7)
Re: benchmarking the query planner

"Robert Haas" <robertmhaas@gmail.com> writes:

Ah, that makes sense. Here's a test case based on Greg's. This is
definitely more than linear once you get above about n = 80, but it's
not quadratic either. n = 1000 is only 43x n = 80, and while that's
surely more than 1000/80 = 12.5, it's also a lot less than (1000/80)^2
= 156.25.

Yeah, that's not too surprising. There's a linear cost associated with
fetching/deconstructing the stats array, and a quadratic cost in the
comparison loop. What these results say is that the upper limit of 1000
keeps you from getting into the range where the linear component is
completely negligible. If you plot the results and try to fit a curve
like c1*x^2 + c2*x to them, you get a very good fit for all the points
above about 80. Below that, the curve has a flatter slope, indicating
a smaller linear cost component. The values in this test are about 100
bytes each, so the breakpoint at 80 seems to correspond to what happens
when the stats array no longer fits in one page.

I replicated your test and got timings like these, using CVS HEAD with
cassert off:

10 1.587
20 1.997
30 2.208
40 2.499
50 2.734
60 3.048
70 3.368
80 3.706
90 4.686
100 6.418
150 10.016
200 13.598
250 17.905
300 22.777
400 33.471
500 46.394
600 61.185
700 77.664
800 96.304
900 116.615
1000 140.117

So this machine is a bit slower than yours, but otherwise it's pretty
consistent with your numbers. I then modified the test to use

array[random()::text,random()::text,random()::text,random()::text,random()::text,random()::text]

ie, the same data except stuffed into an array. I did this because
I know that array_eq is pretty durn expensive, and indeed:

10 1.662
20 2.478
30 3.119
40 3.885
50 4.636
60 5.437
70 6.403
80 7.427
90 8.473
100 9.597
150 16.542
200 24.919
250 35.225
300 47.423
400 76.509
500 114.076
600 157.535
700 211.189
800 269.731
900 335.427
1000 409.638

When looking at these numbers one might think the threshold of pain
is about 50, rather than 100 which is where I'd put it for the text
example. However, this is probably an extreme worst case.

On the whole I think we have some evidence here to say that upping the
default value of default_stats_target to 100 wouldn't be out of line,
but 1000 definitely is. Comments?

BTW, does anyone have an opinion about changing the upper limit for
default_stats_target to, say, 10000? These tests suggest that you
wouldn't want such a value for a column used as a join key, but
I can see a possible argument for high values in text search and
similar applications.

regards, tom lane

#9Vladimir Sitnikov
sitnikov.vladimir@gmail.com
In reply to: Tom Lane (#8)
Re: benchmarking the query planner

BTW, does anyone have an opinion about changing the upper limit for
default_stats_target to, say, 10000? These tests suggest that you
wouldn't want such a value for a column used as a join key, but
I can see a possible argument for high values in text search and
similar applications.

Do you consider using hash tables?
I am not sure hash is a perfect match here, however I guess some kind of
data structure might improve N^2 behaviour. Looks like that would improve
both array_eq (that will narrow the list of possible arrays to the single
hash bucket) and large _target (I guess that would improve N^2 to N)

Regards,
Vladimir Sitnikov

#10Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#8)
Re: benchmarking the query planner

When looking at these numbers one might think the threshold of pain
is about 50, rather than 100 which is where I'd put it for the text
example. However, this is probably an extreme worst case.

On the whole I think we have some evidence here to say that upping the
default value of default_stats_target to 100 wouldn't be out of line,
but 1000 definitely is. Comments?

Do you think there's any value in making it scale based on the size of
the table? How hard would it be? If you made it MIN(10 + 0.001 *
estimated_rows, 100), you would probably get most of the benefit while
avoiding unnecessary overhead for small tables.

Otherwise, I am a bit concerned that 10 -> 100 may be too big a jump
for one release, especially since it may cause the statistics to get
toasted in some cases, which comes with a significant performance hit.
I would raise it to 30 or 50 and plan to consider raising it further
down the road. (I realize I just made about a million enemies with
that suggestion.)

BTW, does anyone have an opinion about changing the upper limit for
default_stats_target to, say, 10000? These tests suggest that you
wouldn't want such a value for a column used as a join key, but
I can see a possible argument for high values in text search and
similar applications.

I think that's a good idea. Given that most people probably don't
both fiddling with this parameter at all, it doesn't strike me as much
of a foot-gun. I think you'd need a heck of a big table to justify a
value in that range, but some people may have them.

...Robert

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Vladimir Sitnikov (#9)
Re: benchmarking the query planner

"Vladimir Sitnikov" <sitnikov.vladimir@gmail.com> writes:

Do you consider using hash tables?

Doubt it's really worth it, unless there's some way to amortize the
setup cost across multiple selectivity estimations; which would surely
complicate life.

One thing that just now occurred to me is that as long as we maintain
the convention that MCV lists are in decreasing frequency order, one can
take any prefix of the list and it's a perfectly good MCV list of less
resolution. So one way to reduce the time taken in eqjoinsel is to set
an upper limit on the number of entries considered *by that routine*,
whereas other estimator functions could use larger lists.

regards, tom lane

#12Vladimir Sitnikov
sitnikov.vladimir@gmail.com
In reply to: Tom Lane (#11)
Re: benchmarking the query planner

Do you consider using hash tables?

Doubt it's really worth it, unless there's some way to amortize the
setup cost across multiple selectivity estimations; which would surely
complicate life.

MCV lists are updated only during "analyze" phase, don't they? If the "setup
cost" is the cost of maintaining those "hash tables", it is not going to
hurt much.

One thing that just now occurred to me is that as long as we maintain
the convention that MCV lists are in decreasing frequency order, one can
take any prefix of the list and it's a perfectly good MCV list of less
resolution. So one way to reduce the time taken in eqjoinsel is to set
an upper limit on the number of entries considered *by that routine*,
whereas other estimator functions could use larger lists.

That makes sense, however, linear search for single item in the list of
10'000 elements could take a while. Hash lookup might be better choice.

Regards,
Vladimir Sitnikov

#13Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#11)
Re: benchmarking the query planner

On Thu, Dec 11, 2008 at 2:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Vladimir Sitnikov" <sitnikov.vladimir@gmail.com> writes:

Do you consider using hash tables?

Doubt it's really worth it, unless there's some way to amortize the
setup cost across multiple selectivity estimations; which would surely
complicate life.

One thing that just now occurred to me is that as long as we maintain
the convention that MCV lists are in decreasing frequency order, one can
take any prefix of the list and it's a perfectly good MCV list of less
resolution. So one way to reduce the time taken in eqjoinsel is to set
an upper limit on the number of entries considered *by that routine*,
whereas other estimator functions could use larger lists.

To what extent will that negate the benefit of having those statistics
in the first place?

Here's another idea. If you have a < operator, you could use a
quicksort-type strategy to partition the search space. Pick an
arbitrary element of either list and apply it to all elements of both
lists to divide the initial problem into two problems that are each
half as large. When the subproblems fall below some size threshold,
then solve them according to the existing algorithm. This is O(n^2)
in the worst case, just like quicksort, but the worst case is
difficult to construct.

...Robert

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#10)
Re: benchmarking the query planner

"Robert Haas" <robertmhaas@gmail.com> writes:

On the whole I think we have some evidence here to say that upping the
default value of default_stats_target to 100 wouldn't be out of line,
but 1000 definitely is. Comments?

Do you think there's any value in making it scale based on the size of
the table?

As far as the MCVs go, I think we already have a decent heuristic for
determining the actual length of the array, based on discarding values
that have too small an estimated frequency --- look into
compute_scalar_stats. I don't see that explicitly considering table
size would improve that. It might be worth limiting the size of the
histogram, as opposed to the MCV list, for smaller tables. But that's
not relevant to the speed of eqjoinsel, and AFAIK there aren't any
estimators that are O(N^2) in the histogram length.
(ineq_histogram_selectivity is actually O(log N), so it hardly cares at
all.)

Otherwise, I am a bit concerned that 10 -> 100 may be too big a jump
for one release, especially since it may cause the statistics to get
toasted in some cases, which comes with a significant performance hit.
I would raise it to 30 or 50 and plan to consider raising it further
down the road. (I realize I just made about a million enemies with
that suggestion.)

There's something in what you say, but consider that we have pretty
much unanimous agreement that 10 is too small. I think we should
try to fix the problem, not just gradually ratchet up the value until
people start complaining in the other direction. (Also, we should have
plenty of opportunity during beta to find out if we went too far.)

regards, tom lane

#15Vladimir Sitnikov
sitnikov.vladimir@gmail.com
In reply to: Tom Lane (#14)
Re: benchmarking the query planner

There's something in what you say, but consider that we have pretty
much unanimous agreement that 10 is too small. I think we should
try to fix the problem, not just gradually ratchet up the value until
people start complaining in the other direction. (Also, we should have
plenty of opportunity during beta to find out if we went too far.)

I am not sure if entity-attribute-value model could be used for postgres
database, however that is one of the cases that require large MCV list
(generally, for attribute column).

You know, Oracle is not able to store more than 254 distinct values for
histogram statistics. That really limits the use of histograms for software
product the company I work for creates.

One more direction could be implementing "MCV" for range of values (group
values and interpolate in between). Consider statistics on timestamp column
that says that for "2008-December" there are as many X rows, for
"2008-November" as many as Y, etc. That could be used for rather accurate
cardinality estimation of "between" cases, while keeping number of entries
in "MCV" list small.

Regards,
Vladimir Sitnikov

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Vladimir Sitnikov (#15)
Re: benchmarking the query planner

"Vladimir Sitnikov" <sitnikov.vladimir@gmail.com> writes:

One more direction could be implementing "MCV" for range of values (group
values and interpolate in between). Consider statistics on timestamp column
that says that for "2008-December" there are as many X rows, for
"2008-November" as many as Y, etc. That could be used for rather accurate
cardinality estimation of "between" cases, while keeping number of entries
in "MCV" list small.

I think that would likely be redundant with the histogram.

regards, tom lane

#17Nathan Boley
npboley@gmail.com
In reply to: Vladimir Sitnikov (#15)
Re: benchmarking the query planner

One more direction could be implementing "MCV" for range of values (group
values and interpolate in between). Consider statistics on timestamp column
that says that for "2008-December" there are as many X rows, for
"2008-November" as many as Y, etc. That could be used for rather accurate
cardinality estimation of "between" cases, while keeping number of entries
in "MCV" list small.

I suggested this earlier (
http://archives.postgresql.org/pgsql-hackers/2008-06/msg00353.php ).
If there's interest, I'd be happy to clean up my patch and submit it
for 8.5

-Nathan

#18Vladimir Sitnikov
sitnikov.vladimir@gmail.com
In reply to: Tom Lane (#16)
Re: benchmarking the query planner

I think that would likely be redundant with the histogram.

I am afraid I do not get you. I mean histograms should be considered when it
comes to increasing number of MCV entries (at least for numeric/timestamp
values). With histogram lower number of entries could be sufficient to get
reasonable accuracy of estimations.I have no idea how to decide between
automatic switch between histograms and MCV. It might sound crazy one could
compute both histograms and MCV and use them both (and pick an average of
two estimations :) )
Regards,
Vladimir Sitnikov

#19Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#8)
Re: benchmarking the query planner

On Thu, 2008-12-11 at 13:09 -0500, Tom Lane wrote:

On the whole I think we have some evidence here to say that upping the
default value of default_stats_target to 100 wouldn't be out of line,
but 1000 definitely is. Comments?

Sounds good to me.

I would like it even more if there was a data type specific default.
Currently we have a special case for boolean, but that's it. TPC allows
for different defaults for different types (but that's not my only
reason). That would allow us to set it lower for long text strings and
floats and potentially higher for things like smallint, which is less
likely as a join target.

Would be great if we could set the default_stats_target for all columns
in a table with a single ALTER TABLE statement, rather than setting it
for every column individually.

And I would like it even more if the sample size increased according to
table size, since that makes ndistinct values fairly random for large
tables.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Vladimir Sitnikov (#18)
Re: benchmarking the query planner

"Vladimir Sitnikov" <sitnikov.vladimir@gmail.com> writes:

I think that would likely be redundant with the histogram.

I am afraid I do not get you.

AFAICS what you're proposing *is* a histogram, just awkwardly described.
What is the specific difference between what you are talking about and
what scalarineqsel already implements?

regards, tom lane

#21Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#19)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#21)
#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#19)
#24Vladimir Sitnikov
sitnikov.vladimir@gmail.com
In reply to: Tom Lane (#20)
#25Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#8)
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#25)
#27Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#23)
#28Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#23)
#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#27)
#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#28)
#31Simon Riggs
simon@2ndQuadrant.com
In reply to: Bruce Momjian (#21)
#32Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Simon Riggs (#28)
#33Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#14)
#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#31)
#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#33)
#36Nathan Boley
npboley@gmail.com
In reply to: Vladimir Sitnikov (#24)
#37Tom Lane
tgl@sss.pgh.pa.us
In reply to: Nathan Boley (#36)
#38Nathan Boley
npboley@gmail.com
In reply to: Tom Lane (#37)
#39Tom Lane
tgl@sss.pgh.pa.us
In reply to: Nathan Boley (#38)
#40Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#26)
#41Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#40)
#42Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#31)
#43Nathan Boley
npboley@gmail.com
In reply to: Tom Lane (#39)
#44Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#41)
#45Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#44)
#46Simon Riggs
simon@2ndQuadrant.com
In reply to: Bruce Momjian (#42)
#47Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#34)
#48Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#46)
#49Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#45)
#50Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#48)
#51Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#50)
#52Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#49)
#53Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#48)
#54Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#50)
#55Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#51)
#56Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#53)
#57Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#56)
#58Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Nathan Boley (#43)
#59Robert Haas
robertmhaas@gmail.com
In reply to: Kevin Grittner (#58)
#60Simon Riggs
simon@2ndQuadrant.com
In reply to: Bruce Momjian (#56)
#61Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#57)
#62Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#57)
#63Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#59)
#64Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#57)
#65Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#60)
#66Simon Riggs
simon@2ndQuadrant.com
In reply to: Bruce Momjian (#56)
#67Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Robert Haas (#59)
#68Simon Riggs
simon@2ndQuadrant.com
In reply to: Bruce Momjian (#65)
#69Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#68)
#70Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#64)
#71Simon Riggs
simon@2ndQuadrant.com
In reply to: Alvaro Herrera (#64)
#72Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#62)
#73Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#63)
#74Bruce Momjian
bruce@momjian.us
In reply to: Alvaro Herrera (#63)
#75Bruce Momjian
bruce@momjian.us
In reply to: Alvaro Herrera (#64)
#76Simon Riggs
simon@2ndQuadrant.com
In reply to: Bruce Momjian (#69)
#77Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#60)
#78Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#61)
#79Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#72)
#80Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#77)
#81Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#77)
#82Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#80)
In reply to: Robert Haas (#59)
#84Robert Haas
robertmhaas@gmail.com
In reply to: Euler Taveira de Oliveira (#83)
#85Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#80)
#86Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#82)
#87Ron Mayer
rm_pg@cheapcomplexdevices.com
In reply to: Bruce Momjian (#65)
#88Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#78)
#89Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#88)
#90Greg Smith
gsmith@gregsmith.com
In reply to: Tom Lane (#8)
#91Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#89)
#92ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Robert Haas (#91)
#93Robert Haas
robertmhaas@gmail.com
In reply to: ITAGAKI Takahiro (#92)