Experimental evaluation of PostgreSQL's query optimizer

Started by Viktor Leisabout 10 years ago16 messages
#1Viktor Leis
leis@in.tum.de

Hi,

We have recently performed an experimental evaluation of PostgreSQL's
query optimizer. For example, we measured the contributions of
cardinality estimation and the cost model on the overall query
performance. You can download the resulting paper here:
http://www.vldb.org/pvldb/vol9/p204-leis.pdf

Some findings:
1. Perhaps unsurprisingly, we found that cardinality
estimation is the biggest problem in query optimization.
2. The quality of Postgres' cardinality estimates is not generally worse
than that of the major commerical systems.
3. It seems to me that one obvious way to avoid many bad situations
would be to disable nested loop joins when the inner relation is NOT
an index scan.

I hope this will be of interest to some of you.

--

Viktor Leis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Simon Riggs
simon@2ndQuadrant.com
In reply to: Viktor Leis (#1)
Re: Experimental evaluation of PostgreSQL's query optimizer

On 16 December 2015 at 09:51, Viktor Leis <leis@in.tum.de> wrote:

Hi,

We have recently performed an experimental evaluation of PostgreSQL's
query optimizer. For example, we measured the contributions of
cardinality estimation and the cost model on the overall query
performance. You can download the resulting paper here:
http://www.vldb.org/pvldb/vol9/p204-leis.pdf

Thank you, an excellent contribution.

Some findings:
1. Perhaps unsurprisingly, we found that cardinality
estimation is the biggest problem in query optimization.
2. The quality of Postgres' cardinality estimates is not generally worse
than that of the major commerical systems.

Good to hear, though we have found problems there that alter plan selection
adversely for TPC-H with the current optimizer. We have some improvements
which may be in the next release.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#3Albe Laurenz
laurenz.albe@wien.gv.at
In reply to: Viktor Leis (#1)
Re: Experimental evaluation of PostgreSQL's query optimizer

Viktor Leis wrote:

We have recently performed an experimental evaluation of PostgreSQL's
query optimizer. For example, we measured the contributions of
cardinality estimation and the cost model on the overall query
performance. You can download the resulting paper here:
http://www.vldb.org/pvldb/vol9/p204-leis.pdf

Some findings:
1. Perhaps unsurprisingly, we found that cardinality
estimation is the biggest problem in query optimization.
2. The quality of Postgres' cardinality estimates is not generally worse
than that of the major commerical systems.
3. It seems to me that one obvious way to avoid many bad situations
would be to disable nested loop joins when the inner relation is NOT
an index scan.

I hope this will be of interest to some of you.

I have read the paper with great interest, and I have some comments.

- The paper mentions that the "Join Order Benchmark" has high cross-table
correlation, and this correlation is responsible for bad cardinality
estimates that cause bad plans with all RDBMS.
Wouldn't it be interesting to do the same experiment with a different
real-word data sets to see if that is indeed typical and not an
idiosyncrasy of that specific benchmark?

- The paper suggests that sampling the base tables is preferable to
using statistics because it gives better estimates, but I think that that
is only a win with long running, complicated, data warehouse style queries.
For the typical OLTP query it would incur intolerable planning times.
Any ideas on that?

- From my experience in tuning SQL queries I can confirm your one finding,
namely that bad cardinality estimates are the prime source for bad
plan choices.
Perhaps it would be valuable to start thinking about statistics for
inter-table correlation. What about something as "simple" as a factor
per (joinable) attribute pair that approximates the total row count
of a join on these attributes, divided by the planner's estimate?

- I also can corroborate your finding that nested loop joins are often
harmful, particularly when the inner loop is a sequential scan.
One of the first things I do when investigating bad performance of a query
whose plan has a nestend loop join is to set enable_nestloop to "off"
and see if that makes a difference, and it often does.
Maybe it would be a win to bias the planner against nested loop joins.
This is dreaming, but it might be nice to have some number as to how
reliable a certain estimate is, which is high if the estimate is, say,
derived from a single filter on a base table and sinks as more conditions
are involved or numbers pulled out of thin air.

Yours,
Laurenz Albe

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Viktor Leis
leis@in.tum.de
In reply to: Albe Laurenz (#3)
Re: Experimental evaluation of PostgreSQL's query optimizer

Am 21.12.2015 um 09:22 schrieb Albe Laurenz:

Viktor Leis wrote:

We have recently performed an experimental evaluation of PostgreSQL's
query optimizer. For example, we measured the contributions of
cardinality estimation and the cost model on the overall query
performance. You can download the resulting paper here:
http://www.vldb.org/pvldb/vol9/p204-leis.pdf

Some findings:
1. Perhaps unsurprisingly, we found that cardinality
estimation is the biggest problem in query optimization.
2. The quality of Postgres' cardinality estimates is not generally worse
than that of the major commerical systems.
3. It seems to me that one obvious way to avoid many bad situations
would be to disable nested loop joins when the inner relation is NOT
an index scan.

I hope this will be of interest to some of you.

I have read the paper with great interest, and I have some comments.

- The paper mentions that the "Join Order Benchmark" has high cross-table
correlation, and this correlation is responsible for bad cardinality
estimates that cause bad plans with all RDBMS.
Wouldn't it be interesting to do the same experiment with a different
real-word data sets to see if that is indeed typical and not an
idiosyncrasy of that specific benchmark?

The IMDB data set certainly has lots of correlations in comparison
with synthetic benchmarks like TPC-H.

I do not claim that IMDB is representative (whatever that means). But
it is certainly a data set that people are putting into a database and
run join queries on it. It would indeed be very interesting to do
similar experiments with other real-world data sets. We had to come up
with our own queries because you seldom find combinations of public
relational data sets with non-trivial OLAP queries.

- The paper suggests that sampling the base tables is preferable to
using statistics because it gives better estimates, but I think that that
is only a win with long running, complicated, data warehouse style queries.
For the typical OLTP query it would incur intolerable planning times.
Any ideas on that?

I agree that sampling is not suitable for most OLTP queries. (One hack
would be to run the optimizer without sampling and check if the
estimated cost is high and reoptimize with sampling.)

In data warehouse settings, I've seen suggestions to increase
default_statistics_target to a large value, which in my experience
results in extremely large planning times. Sampling could be a more
precise alternative.

- From my experience in tuning SQL queries I can confirm your one finding,
namely that bad cardinality estimates are the prime source for bad
plan choices.
Perhaps it would be valuable to start thinking about statistics for
inter-table correlation. What about something as "simple" as a factor
per (joinable) attribute pair that approximates the total row count
of a join on these attributes, divided by the planner's estimate?

I think your suggestion amounts to caching the cardinalities of all
two-way joins. One major issue is that for a query like

select * from r1, r2 where r1.x = r2.y and r1.a = ? and r2.b;

it depends on the specific values of r1.a and r2.b whether there is
any (anti-)correlation. And obviously one cannot store correction
factors for each value of a and b.

- I also can corroborate your finding that nested loop joins are often
harmful, particularly when the inner loop is a sequential scan.
One of the first things I do when investigating bad performance of a query
whose plan has a nestend loop join is to set enable_nestloop to "off"
and see if that makes a difference, and it often does.
Maybe it would be a win to bias the planner against nested loop joins.
This is dreaming, but it might be nice to have some number as to how
reliable a certain estimate is, which is high if the estimate is, say,
derived from a single filter on a base table and sinks as more conditions
are involved or numbers pulled out of thin air.

I think it would be a good start to distinguish between nested loop
joins with and without a index. In my opinion, the latter should
simply NEVER be chosen.

--
Viktor Leis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Albe Laurenz (#3)
Re: Experimental evaluation of PostgreSQL's query optimizer

Albe Laurenz <laurenz.albe@wien.gv.at> writes:

- I also can corroborate your finding that nested loop joins are often
harmful, particularly when the inner loop is a sequential scan.
One of the first things I do when investigating bad performance of a query
whose plan has a nestend loop join is to set enable_nestloop to "off"
and see if that makes a difference, and it often does.
Maybe it would be a win to bias the planner against nested loop joins.
This is dreaming,

Yeah, it sure is, because there's another half of the world that complains
bitterly any time they *don't* get a nested-loop plan; and for their
queries and data, they're right. I'd be the first to agree that it would
be good to make the planner smarter, but simplistic solutions like "bias
it against (or for) nested loops" aren't likely to improve matters.

but it might be nice to have some number as to how
reliable a certain estimate is, which is high if the estimate is, say,
derived from a single filter on a base table and sinks as more conditions
are involved or numbers pulled out of thin air.

That might indeed be a useful thing to try to do, though I confess I'm
not quite sure what we'd do with such numbers if we had them. It seems
like the next thing would be to replace single cost estimates for plans
with cost ranges, but then how do you compare a couple of overlapping
ranges?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Viktor Leis (#4)
Re: Experimental evaluation of PostgreSQL's query optimizer

Viktor Leis <leis@in.tum.de> writes:

I think it would be a good start to distinguish between nested loop
joins with and without a index.

We do.

In my opinion, the latter should simply NEVER be chosen.

So, if you're given a query with a non-equality join condition
that doesn't match any index on either table, you think the planner
should just fail? This is not realistic. Certainly nestloop with
inner seqscan is likely to be slow, but that's already reflected
in the cost estimates.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Viktor Leis
leis@in.tum.de
In reply to: Tom Lane (#6)
Re: Experimental evaluation of PostgreSQL's query optimizer

Am 21.12.2015 um 15:42 schrieb Tom Lane:

Viktor Leis <leis@in.tum.de> writes:

I think it would be a good start to distinguish between nested loop
joins with and without a index.

We do.

In my opinion, the latter should simply NEVER be chosen.

So, if you're given a query with a non-equality join condition
that doesn't match any index on either table, you think the planner
should just fail? This is not realistic. Certainly nestloop with
inner seqscan is likely to be slow, but that's already reflected
in the cost estimates.

You are right that for non-equality joins there is no alternative, so
nested loop is obviously the right choice. However, that does not make
the selection of nested loop join in cases where a hash join would be
possible a good choice.

Please have a look at Figure 6 (page 6) in
http://www.vldb.org/pvldb/vol9/p204-leis.pdf Disabling nested loop
joins without index scan (going from (a) to (b)) results in great
improvements across the board. And even more importantly, it avoids
most of the cases where queries took unreasonably long and timed
out. Basically this amounts to the being able to run the query on
PostgreSQL or not.

The cost model does not save you because the estimated cardinality is
close to 1. Also note that a hash join is fast too if the estimate is
correct. Picking nested loop join in these situations is very risky
but there is practically no upside for this decision.

--
Viktor Leis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Craig Ringer
craig@2ndquadrant.com
In reply to: Viktor Leis (#7)
Re: Experimental evaluation of PostgreSQL's query optimizer

On 21 December 2015 at 23:57, Viktor Leis <leis@in.tum.de> wrote:

Please have a look at Figure 6 (page 6) in
http://www.vldb.org/pvldb/vol9/p204-leis.pdf Disabling nested loop
joins without index scan (going from (a) to (b)) results in great
improvements across the board. And even more importantly, it avoids
most of the cases where queries took unreasonably long and timed
out. Basically this amounts to the being able to run the query on
PostgreSQL or not.

For that data, yes. But you're ignoring other important cases. Small or
even 1-element lookup tables can be one where a nestloop over a seqscan
turns out to be by far the fastest way to do the job. This can really add
up if it's deep in a subplan that's excuted repeatedly, or if it's part of
queries that get run very frequently on a busy OLTP system.

That said, these cases are also the ones that land up hurting very badly if
the stats are inaccurate or outdated and our expected 3 loops turns into
3000.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#9Craig Ringer
craig@2ndquadrant.com
In reply to: Viktor Leis (#4)
Re: Experimental evaluation of PostgreSQL's query optimizer

On 21 December 2015 at 20:53, Viktor Leis <leis@in.tum.de> wrote:

I think your suggestion amounts to caching the cardinalities of all
two-way joins. One major issue is that for a query like

select * from r1, r2 where r1.x = r2.y and r1.a = ? and r2.b;

it depends on the specific values of r1.a and r2.b whether there is
any (anti-)correlation. And obviously one cannot store correction
factors for each value of a and b.

I see a parallel with indexing and partial indexes here.

We obviously cannot afford to keep cross-table correlations for every
possible pairing of join conditions across every possible set of joined
tables. Much like we can't afford to keep indexes for every possible set of
columns, but even worse.

Much as we let users CREATE INDEX to tell us what cols to index, maybe we
should let them CREATE a cross-table join statistics collector for a
particular set of tables, optionally qualified with a filter condition just
like we do on partial indexes, and optionally transformed via an immutable
expression like we do for expression indexes, e.g.:

CREATE JOIN STATISTICS ON t1 JOIN t2 ON (t1.col1 = t2.col2);

CREATE JOIN STATISTICS ON t1 JOIN t2 ON (lower(t1.col1) = lower(t2.col2))
WHERE t1.othercol IS NOT NULL;

CREATE JOIN STATISTICS ON t1 JOIN t2 ON (t1.colx = t2.colx AND t1.coly =
t2.coly);

plus a simplified form like

CREATE JOIN STATISTICS ON t1 JOIN t2 USING (somecol);

That way we let an admin who's tuning queries direct effort at problem
areas. It's not automagical, but it's an area where tools could analyze
pg_stat_statements to direct effort, much like is currently done for index
creation. Like index creation I don't think it's practical to do this
entirely automatically and behind the scenes since collecting the stats for
all possibilities rapidly gets prohibitive.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#10Viktor Leis
leis@in.tum.de
In reply to: Craig Ringer (#8)
Re: Experimental evaluation of PostgreSQL's query optimizer

On 12/22/2015 02:40 AM, Craig Ringer wrote:

On 21 December 2015 at 23:57, Viktor Leis <leis@in.tum.de <mailto:leis@in.tum.de>> wrote:

Please have a look at Figure 6 (page 6) in
http://www.vldb.org/pvldb/vol9/p204-leis.pdf Disabling nested loop
joins without index scan (going from (a) to (b)) results in great
improvements across the board. And even more importantly, it avoids
most of the cases where queries took unreasonably long and timed
out. Basically this amounts to the being able to run the query on
PostgreSQL or not.

For that data, yes. But you're ignoring other important cases. Small or even 1-element lookup tables can be one where a nestloop over a seqscan turns out to be by far the fastest way to do the job.
This can really add up if it's deep in a subplan that's excuted repeatedly, or if it's part of queries that get run very frequently on a busy OLTP system.

Ok here's what I presume to be the extreme case: Joining a large table
with a 1-entry table.

create table r1 (a int not null);
create table r2 (b int not null);
insert into r1 select 1 from generate_series(1,1000000);
insert into r2 values (1);
analyze r1;
analyze r2;

set enable_mergejoin to off;
set enable_nestloop to on;
set enable_hashjoin to off;
explain select count(*) from r1, r2 where r1.a = r2.b;
\timing
select count(*) from r1, r2 where r1.a = r2.b;
\timing

set enable_nestloop to off;
set enable_hashjoin to on;
explain select count(*) from r1, r2 where r1.a = r2.b;
\timing
select count(*) from r1, r2 where r1.a = r2.b;
\timing

I get 128.894ms vs. 183.724ms, i.e., a 43% slowdown for the hash
join. However, let me stress that this is really the extreme case:

- If the join has few matches (due to inserting a value different from
1 into r2), hash and nested loop join have pretty much the same
performance.

- If you add just one more row to r2, the hash join is faster by a
similar margin.

- Also if there is disk IO or network involved, I suspect that you
will see no performance differences.

There are many difficult tradeoffs in any query optimizer, but I do
not think picking nested loops where a hash join can be used is one of
those. To me this seems more like a self-inflicted wound.

--
Viktor Leis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Pavel Stehule
pavel.stehule@gmail.com
In reply to: Viktor Leis (#10)
Re: Experimental evaluation of PostgreSQL's query optimizer

2015-12-22 9:28 GMT+01:00 Viktor Leis <leis@in.tum.de>:

On 12/22/2015 02:40 AM, Craig Ringer wrote:

On 21 December 2015 at 23:57, Viktor Leis <leis@in.tum.de <mailto:

leis@in.tum.de>> wrote:

Please have a look at Figure 6 (page 6) in
http://www.vldb.org/pvldb/vol9/p204-leis.pdf Disabling nested loop
joins without index scan (going from (a) to (b)) results in great
improvements across the board. And even more importantly, it avoids
most of the cases where queries took unreasonably long and timed
out. Basically this amounts to the being able to run the query on
PostgreSQL or not.

For that data, yes. But you're ignoring other important cases. Small or

even 1-element lookup tables can be one where a nestloop over a seqscan
turns out to be by far the fastest way to do the job.

This can really add up if it's deep in a subplan that's excuted

repeatedly, or if it's part of queries that get run very frequently on a
busy OLTP system.
Ok here's what I presume to be the extreme case: Joining a large table
with a 1-entry table.

create table r1 (a int not null);
create table r2 (b int not null);
insert into r1 select 1 from generate_series(1,1000000);
insert into r2 values (1);
analyze r1;
analyze r2;

set enable_mergejoin to off;
set enable_nestloop to on;
set enable_hashjoin to off;
explain select count(*) from r1, r2 where r1.a = r2.b;
\timing
select count(*) from r1, r2 where r1.a = r2.b;
\timing

set enable_nestloop to off;
set enable_hashjoin to on;
explain select count(*) from r1, r2 where r1.a = r2.b;
\timing
select count(*) from r1, r2 where r1.a = r2.b;
\timing

I get 128.894ms vs. 183.724ms, i.e., a 43% slowdown for the hash
join. However, let me stress that this is really the extreme case:

- If the join has few matches (due to inserting a value different from
1 into r2), hash and nested loop join have pretty much the same
performance.

- If you add just one more row to r2, the hash join is faster by a
similar margin.

- Also if there is disk IO or network involved, I suspect that you
will see no performance differences.

There are many difficult tradeoffs in any query optimizer, but I do
not think picking nested loops where a hash join can be used is one of
those. To me this seems more like a self-inflicted wound.

this is oversimplification :( Probably correct in OLAP, but wrong in OLTP.
The seq scan enforced by hash join can be problematic.

Regards

Pavel

Show quoted text

--
Viktor Leis

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Robert Haas
robertmhaas@gmail.com
In reply to: Viktor Leis (#10)
Re: Experimental evaluation of PostgreSQL's query optimizer

On Tue, Dec 22, 2015 at 3:28 AM, Viktor Leis <leis@in.tum.de> wrote:

I get 128.894ms vs. 183.724ms, i.e., a 43% slowdown for the hash
join. However, let me stress that this is really the extreme case:

- If the join has few matches (due to inserting a value different from
1 into r2), hash and nested loop join have pretty much the same
performance.

- If you add just one more row to r2, the hash join is faster by a
similar margin.

- Also if there is disk IO or network involved, I suspect that you
will see no performance differences.

There are many difficult tradeoffs in any query optimizer, but I do
not think picking nested loops where a hash join can be used is one of
those. To me this seems more like a self-inflicted wound.

Well, that's a fair perspective. I don't think there's any debate
about the fact that we sometimes pick nested loops when we shouldn't,
or that that this can be very painful. Figuring out exactly what to
do about that at the code level is harder.

From my point of view, one interesting fact about database
optimization is that the numbers 0 and 1 are phenomenally important
special cases. It is often the case that a join will return at most 1
row per outer row, or that an aggregate will generate exactly 1 group,
or whatever. And the code is littered with special cases - including
Nested Loop - that cater to making such cases fast. Those cases arise
frequently because people engineer their data so that they occur
frequently.

If we could bias the planner against picking nested loops in cases
where they will figure to win only a little but might conceivably lose
a lot, that would probably be a good idea. But it's not obvious
exactly how to figure that out.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#12)
Re: Experimental evaluation of PostgreSQL's query optimizer

Robert Haas <robertmhaas@gmail.com> writes:

From my point of view, one interesting fact about database
optimization is that the numbers 0 and 1 are phenomenally important
special cases.

Yeah.

It is often the case that a join will return at most 1
row per outer row, or that an aggregate will generate exactly 1 group,
or whatever. And the code is littered with special cases - including
Nested Loop - that cater to making such cases fast. Those cases arise
frequently because people engineer their data so that they occur
frequently.

If we could bias the planner against picking nested loops in cases
where they will figure to win only a little but might conceivably lose
a lot, that would probably be a good idea. But it's not obvious
exactly how to figure that out.

There was discussion awhile ago of trying to teach the planner to generate
rowcount estimates of 0 or 1 row only in cases where that was provably the
case, eg because the query selects on a unique key. In any other
situation, rowcount estimates would be clamped to a minimum of 2 rows.
This should by itself eliminate the worst abuses of nestloop plans, since
the planner would always assume that the outer scan contains at least 2
rows unless it's known not to. Still, there might be a lot of other less
pleasant side effects; it's hard to tell in advance of actually doing the
work.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Craig Ringer (#9)
Re: Experimental evaluation of PostgreSQL's query optimizer

On 12/21/15 7:49 PM, Craig Ringer wrote:

CREATE JOIN STATISTICS ON t1 JOIN t2 USING (somecol);

That way we let an admin who's tuning queries direct effort at problem
areas. It's not automagical, but it's an area where tools could analyze
pg_stat_statements to direct effort, much like is currently done for
index creation. Like index creation I don't think it's practical to do
this entirely automatically and behind the scenes since collecting the
stats for all possibilities rapidly gets prohibitive.

There's an extremely common case that could (and IMO should) be
automated though, which is computing these statistics for all foreign
keys. We can have a way to disable that for specific keys if necessary,
but I'd bet it's extremely rare to have a FK that you never join on.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Tom Lane (#5)
Re: Experimental evaluation of PostgreSQL's query optimizer

On 12/21/15 8:36 AM, Tom Lane wrote:

but it might be nice to have some number as to how

reliable a certain estimate is, which is high if the estimate is, say,
derived from a single filter on a base table and sinks as more conditions
are involved or numbers pulled out of thin air.

That might indeed be a useful thing to try to do, though I confess I'm
not quite sure what we'd do with such numbers if we had them. It seems
like the next thing would be to replace single cost estimates for plans
with cost ranges, but then how do you compare a couple of overlapping
ranges?

I suspect that doesn't happen often enough to be a big loss if we just
compare the midpoints in those cases, at least for a first pass. Beyond
that, I think it'd need to be handled on a per-node basis. Nodes that
depend heavily on having low row counts (like nested loop) would want to
heavily penalize the high side of the range. Nodes that aren't as
sensitive to that might not do that. There could possibly be some nodes
that actually penalize the low side of the range.

BTW, I suspect rather than a simple range we might want to pass the
exactly computed estimate as well as the estimated upper and lower error
margin for the estimate. An estimate of 827 +0 -400 could have very
different meaning than an estimate of [427,827].
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Craig Ringer
craig@2ndquadrant.com
In reply to: Jim Nasby (#14)
Re: Experimental evaluation of PostgreSQL's query optimizer

On 24 December 2015 at 03:43, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

On 12/21/15 7:49 PM, Craig Ringer wrote:

CREATE JOIN STATISTICS ON t1 JOIN t2 USING (somecol);

That way we let an admin who's tuning queries direct effort at problem
areas. It's not automagical, but it's an area where tools could analyze
pg_stat_statements to direct effort, much like is currently done for
index creation. Like index creation I don't think it's practical to do
this entirely automatically and behind the scenes since collecting the
stats for all possibilities rapidly gets prohibitive.

There's an extremely common case that could (and IMO should) be automated
though, which is computing these statistics for all foreign keys. We can
have a way to disable that for specific keys if necessary, but I'd bet it's
extremely rare to have a FK that you never join on.

Good point.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services