About method of PostgreSQL's Optimizer

Started by Pryscila B Guttoskiover 20 years ago16 messages
#1Pryscila B Guttoski
pryscila.lista@gmail.com

Hello all!

On my master course, I'm studying the PostgreSQL's optimizer.
I don't know if anyone in this list have been participated from the
PostgreSQL's Optimizer development, but maybe someone can help me on this
question.
PostgreSQL generates all possible plans of executing the query (using an
almost exhaustive search), then gives a cost to each plan and finally the
cheapest one is selected for execution.
There are other methods for query optimization, one of them is based on plan
transformations (for example, using A-Star algorithm) instead of plan
constructions used by PostgreSQL.
Does anyone know why this method was choosen? Are there any papers or
researches about it?

Thank's a lot,
Pryscila.

#2Neil Conway
neilc@samurai.com
In reply to: Pryscila B Guttoski (#1)
Re: About method of PostgreSQL's Optimizer

Pryscila B Guttoski wrote:

On my master course, I'm studying the PostgreSQL's optimizer.
I don't know if anyone in this list have been participated from the
PostgreSQL's Optimizer development, but maybe someone can help me on this
question.

pgsql-hackers might be more appropriate.

PostgreSQL generates all possible plans of executing the query (using an
almost exhaustive search), then gives a cost to each plan and finally the
cheapest one is selected for execution.
There are other methods for query optimization, one of them is based on plan
transformations (for example, using A-Star algorithm) instead of plan
constructions used by PostgreSQL.

Right, the main query planner uses a nearly-exhaustive search. For
queries with many joins (when the cost of an exhaustive search would be
prohibitive), "GEQO" is used, which uses a genetic algorithm to avoid an
exhaustive search of the solution space.

Does anyone know why this method was choosen?

As far as I know, the main planner algorithm is fairly standard and is
mainly different from System R's canonical algorithm in the details,
like whether non-left-deep plans are pruned.

Are there any papers or researches about it?

There are many papers on the System R algorithm and similar techniques,
which should explain the basic motivations for the design. I'm not aware
of any papers specifically on the PostgreSQL query optimizer, although
there have been a few presentations on it:

http://neilc.treehou.se/optimizer.pdf
http://conferences.oreillynet.com/presentations/os2003/lane_tom.pdf

-Neil

#3Joshua D. Drake
jd@commandprompt.com
In reply to: Pryscila B Guttoski (#1)
Re: About method of PostgreSQL's Optimizer

Pryscila B Guttoski wrote:

Hello all!

On my master course, I'm studying the PostgreSQL's optimizer.
I don't know if anyone in this list have been participated from the
PostgreSQL's Optimizer development, but maybe someone can help me on
this question.
PostgreSQL generates all possible plans of executing the query (using an
almost exhaustive search), then gives a cost to each plan and finally
the cheapest one is selected for execution.
There are other methods for query optimization, one of them is based on
plan transformations (for example, using A-Star algorithm) instead of
plan constructions used by PostgreSQL.
Does anyone know why this method was choosen? Are there any papers or
researches about it?

You may want to pass this question over to pgsql-hackers.

Sincerely,

Joshua D. Drake

Thank's a lot,
Pryscila.

--
Your PostgreSQL solutions company - Command Prompt, Inc. 1.800.492.2240
PostgreSQL Replication, Consulting, Custom Programming, 24x7 support
Managed Services, Shared and Dedicated Hosting
Co-Authors: plPHP, plPerlNG - http://www.commandprompt.com/

#4Cristian Prieto
cristian@clickdiario.com
In reply to: Pryscila B Guttoski (#1)
Re: About method of PostgreSQL's Optimizer

I know you almost had read this, but I think it is a good paper to start with...

http://lca2005.linux.org.au/Papers/Neil%20Conway/Inside%20the%20PostgreSQL%20Query%20Optimizer/pg_query_optimizer.pdf

Anyway, do you know where could I get more info and theory about database optimizer plan? (in general) I like that topic, thanks a lot man!
----- Original Message -----
From: Pryscila B Guttoski
To: pgsql-performance@postgresql.org
Sent: Tuesday, September 13, 2005 4:50 PM
Subject: [PERFORM] About method of PostgreSQL's Optimizer

Hello all!

On my master course, I'm studying the PostgreSQL's optimizer.
I don't know if anyone in this list have been participated from the PostgreSQL's Optimizer development, but maybe someone can help me on this question.
PostgreSQL generates all possible plans of executing the query (using an almost exhaustive search), then gives a cost to each plan and finally the cheapest one is selected for execution.
There are other methods for query optimization, one of them is based on plan transformations (for example, using A-Star algorithm) instead of plan constructions used by PostgreSQL.
Does anyone know why this method was choosen? Are there any papers or researches about it?

Thank's a lot,
Pryscila.

#5Pryscila B Guttoski
pryscila.lista@gmail.com
In reply to: Neil Conway (#2)
Re: About method of PostgreSQL's Optimizer

Thank's guys!
I'll send to pgsql-hackers...

[]'s
Pryscila

Show quoted text

On 9/13/05, Neil Conway <neilc@samurai.com> wrote:

Pryscila B Guttoski wrote:

On my master course, I'm studying the PostgreSQL's optimizer.
I don't know if anyone in this list have been participated from the
PostgreSQL's Optimizer development, but maybe someone can help me on

this

question.

pgsql-hackers might be more appropriate.

PostgreSQL generates all possible plans of executing the query (using an
almost exhaustive search), then gives a cost to each plan and finally

the

cheapest one is selected for execution.
There are other methods for query optimization, one of them is based on

plan

transformations (for example, using A-Star algorithm) instead of plan
constructions used by PostgreSQL.

Right, the main query planner uses a nearly-exhaustive search. For
queries with many joins (when the cost of an exhaustive search would be
prohibitive), "GEQO" is used, which uses a genetic algorithm to avoid an
exhaustive search of the solution space.

Does anyone know why this method was choosen?

As far as I know, the main planner algorithm is fairly standard and is
mainly different from System R's canonical algorithm in the details,
like whether non-left-deep plans are pruned.

Are there any papers or researches about it?

There are many papers on the System R algorithm and similar techniques,
which should explain the basic motivations for the design. I'm not aware
of any papers specifically on the PostgreSQL query optimizer, although
there have been a few presentations on it:

http://neilc.treehou.se/optimizer.pdf
http://conferences.oreillynet.com/presentations/os2003/lane_tom.pdf

-Neil

#6Pryscila B Guttoski
pryscila.lista@gmail.com
In reply to: Pryscila B Guttoski (#1)

Hello all!

On my master course, I'm studying the PostgreSQL's optimizer.
I don't know if anyone in this list have been participated from the
PostgreSQL's Optimizer development, but maybe someone can help me on this
question.
PostgreSQL generates all possible plans of executing the query (using an
almost exhaustive search), then gives a cost to each plan and finally the
cheapest one is selected for execution.
There are other methods for query optimization, one of them is based on plan
transformations (for example, using A-Star algorithm) instead of plan
constructions used by PostgreSQL.
Does anyone know why this method was choosen? Are there any papers or
researches about it?

Thank's a lot,
Pryscila.

#7Neil Conway
neilc@samurai.com
In reply to: Cristian Prieto (#4)
Re: About method of PostgreSQL's Optimizer

Cristian Prieto wrote:

Anyway, do you know where could I get more info and theory about
database optimizer plan? (in general)

Personally I like this survey paper on query optimization:

http://citeseer.csail.mit.edu/371707.html

The paper also cites a lot of other papers that cover specific
techniques in more detail.

-Neil

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Neil Conway (#2)
Re: About method of PostgreSQL's Optimizer

Neil Conway <neilc@samurai.com> writes:

Pryscila B Guttoski wrote:

On my master course, I'm studying the PostgreSQL's optimizer.
I don't know if anyone in this list have been participated from the
PostgreSQL's Optimizer development, but maybe someone can help me on this
question.

pgsql-hackers might be more appropriate.

AFAIK the basic code goes back to Berkeley days. Elein might possibly
remember something about it, but no one else that's on the project now
was involved then. The right place to look would be in the Berkeley
project's publications:

http://db.cs.berkeley.edu//papers/

I agree with Neil's point that it's a spiritual descendant of System R
and there's plenty of material about that in the general database
literature.

regards, tom lane

#9Jonah H. Harris
jonah.harris@gmail.com
In reply to: Pryscila B Guttoski (#6)
Re: About method of PostgreSQL's Optimizer

Pryscila,

While I haven't been too involved in the open source PostgreSQL optimizer, I
have done some work on it and optimizers in other database systems.

Based on my work, it is my opinion that PostgreSQL, as-well-as other
databases which use a cost-based optimizer, prefer a breadth-first algorithm
because one cannot determine the "real" cost of each node at run-time
without systematically examining all possibilities through calculation. This
is the opposite of a rule-based optimizer which defines heuristics which can
be evaulated by a best-first algorithm such as A*.

In a cost-based optimizer, the system must calculate the "cost" of each path
based on data that changes during run-time including indexing, cardinality,
tuple size, available memory, CPU usage, disk access times, etc. To a
cost-based optimizer, every query is unique and therefore cannot follow a
weighted path in the same fashion. I can certainly see A* being used in a
rule-based optimizer but not in a real-time cost-based optimizer.

Perhaps Tom, Bruce, et al have more specifics on PostgreSQL's
implementation.

-Jonah

On 9/13/05, Pryscila B Guttoski <pryscila.lista@gmail.com> wrote:

Hello all!

On my master course, I'm studying the PostgreSQL's optimizer.
I don't know if anyone in this list have been participated from the
PostgreSQL's Optimizer development, but maybe someone can help me on this
question.
PostgreSQL generates all possible plans of executing the query (using an
almost exhaustive search), then gives a cost to each plan and finally the
cheapest one is selected for execution.
There are other methods for query optimization, one of them is based on
plan transformations (for example, using A-Star algorithm) instead of plan
constructions used by PostgreSQL.
Does anyone know why this method was choosen? Are there any papers or
researches about it?

Thank's a lot,
Pryscila.

--
Respectfully,

Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
http://www.enterprisedb.com/

#10Pryscila B Guttoski
pryscila.lista@gmail.com
In reply to: Tom Lane (#8)
Re: About method of PostgreSQL's Optimizer

Hi guys,

I really appreciate your suggestions abouts papers, specially this one:
http://citeseer.csail.mit.edu/371707.html

I found some answers on it, like this:

Q: Why the main query planner uses a nearly-exhaustive search?
A: (Page 20 - 4.2.2) ... up to about ten joins, dynamic programming is
preferred over the randomized algorithms because it is faster and it
guarantees finding the optimal plan. For larger queries, the situation is
reversed, and despite the probabilistic nature of the randomized algorithms,
their efficiency makes them the algorithms of choice.

Also in this paper, there is something about the A* algorithm very
interesting for my research.

I have one more question, sorry for doing it on this list, but only here I
had answers...
Does anybody hear anything about using PDDL ("Planning Domain Definition
Language") for query optimization?

[]'s,
Pryscila

Show quoted text

On 9/13/05, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Neil Conway <neilc@samurai.com> writes:

Pryscila B Guttoski wrote:

On my master course, I'm studying the PostgreSQL's optimizer.
I don't know if anyone in this list have been participated from the
PostgreSQL's Optimizer development, but maybe someone can help me on

this

question.

pgsql-hackers might be more appropriate.

AFAIK the basic code goes back to Berkeley days. Elein might possibly
remember something about it, but no one else that's on the project now
was involved then. The right place to look would be in the Berkeley
project's publications:

http://db.cs.berkeley.edu//papers/

I agree with Neil's point that it's a spiritual descendant of System R
and there's plenty of material about that in the general database
literature.

regards, tom lane

#11Josh Berkus
josh@agliodbs.com
In reply to: Jonah H. Harris (#9)
Re: About method of PostgreSQL's Optimizer

Pryscila,

There are other methods for query optimization, one of them is based on
plan transformations (for example, using A-Star algorithm) instead of
plan constructions used by PostgreSQL.

We do certainly need a specific optimization for large star-schema joins. I'm
not certain that A* is suitable for our cost model, though; I think we might
need to work up something more particular to Postgres.

Does anyone know why this method was choosen? Are there any papers or
researches about it?

There probably are on ACM but I've not read them. Ours is a pretty
straightforward implementation of a cost-based optimizer. You can always
read the code ;-)

Mark Kirkwood put together this nice paper on planner statistics:
http://www.powerpostgresql.com/PlanStats

--
Josh Berkus
Aglio Database Solutions
San Francisco

#12Pryscila B Guttoski
pryscila.lista@gmail.com
In reply to: Jonah H. Harris (#9)
Re: About method of PostgreSQL's Optimizer

Hi Jonah,

Thank's for your email, I really appreciate your opinions.

Is it interesting to use both techniques? For example:
Given a query, an optimizer:
1. Generates one of the possible execution plans.
2. Does transformations on the original plan, based on rules and
heuristics, resulting in new alternative plans.
3. Evaluates the cost of generated plans by using statistics.
4. Keeps plans that have lower cost than the original plan
5. Repeat 2-4 over the new alternative plans.
What do you think about it? Are there any restrictions that I haven't seen?

About other method...
Have you heard about using PDDL ("Planning Domain Definition
Language") for query optimization?

[]'s
Pryscila

Show quoted text

On 9/14/05, Jonah H. Harris <jonah.harris@gmail.com> wrote:

Pryscila,

While I haven't been too involved in the open source PostgreSQL optimizer,
I have done some work on it and optimizers in other database systems.

Based on my work, it is my opinion that PostgreSQL, as-well-as other
databases which use a cost-based optimizer, prefer a breadth-first algorithm
because one cannot determine the "real" cost of each node at run-time
without systematically examining all possibilities through calculation.
This is the opposite of a rule-based optimizer which defines heuristics
which can be evaulated by a best-first algorithm such as A*.

In a cost-based optimizer, the system must calculate the "cost" of each
path based on data that changes during run-time including indexing,
cardinality, tuple size, available memory, CPU usage, disk access times,
etc. To a cost-based optimizer, every query is unique and therefore cannot
follow a weighted path in the same fashion. I can certainly see A* being
used in a rule-based optimizer but not in a real-time cost-based optimizer.

Perhaps Tom, Bruce, et al have more specifics on PostgreSQL's
implementation.

-Jonah

On 9/13/05, Pryscila B Guttoski <pryscila.lista@gmail.com> wrote:

Hello all!

On my master course, I'm studying the PostgreSQL's optimizer.
I don't know if anyone in this list have been participated from the

PostgreSQL's Optimizer development, but maybe someone can help me on this
question.

PostgreSQL generates all possible plans of executing the query (using an

almost exhaustive search), then gives a cost to each plan and finally the
cheapest one is selected for execution.

There are other methods for query optimization, one of them is based on

plan transformations (for example, using A-Star algorithm) instead of plan
constructions used by PostgreSQL.

Does anyone know why this method was choosen? Are there any papers or

researches about it?

Thank's a lot,
Pryscila.

--
Respectfully,

Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
http://www.enterprisedb.com/

#13Jonah H. Harris
jonah.harris@gmail.com
In reply to: Pryscila B Guttoski (#12)
Re: About method of PostgreSQL's Optimizer

Pryscila,

Step 2 is basically where you find the difference between a cost-based
optimizer (CBO) and a rule-based optimizer (RBO). A CBO is based on the
computed execution cost of the query whereas an RBO uses more generalized
heuristics.

Let's get an example of what you're proposing and see if we can work it out
from there.

Say we have the following (this is a generalized CBO approach, not
PostgreSQL specific):

Oracle's SCOTT.EMP table with cardinality of 1 million and an index on empno
and ename. For storage purposes say that the empno index takes up 3600
blocks, the ename index takes up 7800 blocks, and the table itself takes up
17000 blocks. We'll also say that we have a 256 megabyte buffer cache of
which we have cached 50% of the empno index, 10% of the ename index, and 5%
of the emp table data.

A user then issues the following query:

SELECT empno, ename FROM emp;

A cost-based optimizer will see the following:
1. See that the query is a full table scan (FTS) and calculate the cost of
retrieving all 17000 blocks from disk.
2. See that the query is a FTS and that it can retrieve all data from the
indexes (11400 blocks) and join the data (which join algorithm?)

Without performing a breadth-first algorithm, how can one evaluate both
options in a way that would allow you to perform heuristic transformations
dynamically? What transformation/heuristic/rule can you use? A CBO
implementation has to calculate the amount of I/O needed on each plan based
on several statistics such as what's *potentially* in the cache, what's the
access time for block I/O (including prefetching if the storage manager has
it), and other factors. If you could name a database that uses a best-first
algorithm, such as A*, please send me the link to their docs; I'd be
interested in reading the implementation.

As for using both in the same optimizer, I could only see an algorithm such
as a customized-A* being used to planning *some* large queries. The reason I
say this is because the cost calculation, which would still need to be
breadth-first, could calculate and cache the cost of most nodes thereby
allowing you to possibly perform transformations at the tail of calculation.

As for references to query optimization possibly using best-first
algorithms, I think I saw several types of algorithms used in work from a
university query optimization engine. I can't remember if it was Cornell,
Stanford, or Wisconsin... I'll try and get you a link to their info.

-Jonah

On 9/14/05, Pryscila B Guttoski <pryscila.lista@gmail.com> wrote:

Hi Jonah,

Thank's for your email, I really appreciate your opinions.

Is it interesting to use both techniques? For example:
Given a query, an optimizer:
1. Generates one of the possible execution plans.
2. Does transformations on the original plan, based on rules and
heuristics, resulting in new alternative plans.
3. Evaluates the cost of generated plans by using statistics.
4. Keeps plans that have lower cost than the original plan
5. Repeat 2-4 over the new alternative plans.
What do you think about it? Are there any restrictions that I haven't
seen?

About other method...
Have you heard about using PDDL ("Planning Domain Definition
Language") for query optimization?

[]'s
Pryscila

On 9/14/05, Jonah H. Harris <jonah.harris@gmail.com> wrote:

Pryscila,

While I haven't been too involved in the open source PostgreSQL

optimizer,

I have done some work on it and optimizers in other database systems.

Based on my work, it is my opinion that PostgreSQL, as-well-as other
databases which use a cost-based optimizer, prefer a breadth-first

algorithm

because one cannot determine the "real" cost of each node at run-time
without systematically examining all possibilities through calculation.
This is the opposite of a rule-based optimizer which defines heuristics
which can be evaulated by a best-first algorithm such as A*.

In a cost-based optimizer, the system must calculate the "cost" of each
path based on data that changes during run-time including indexing,
cardinality, tuple size, available memory, CPU usage, disk access times,
etc. To a cost-based optimizer, every query is unique and therefore

cannot

follow a weighted path in the same fashion. I can certainly see A* being
used in a rule-based optimizer but not in a real-time cost-based

optimizer.

Perhaps Tom, Bruce, et al have more specifics on PostgreSQL's
implementation.

-Jonah

On 9/13/05, Pryscila B Guttoski <pryscila.lista@gmail.com> wrote:

Hello all!

On my master course, I'm studying the PostgreSQL's optimizer.
I don't know if anyone in this list have been participated from the

PostgreSQL's Optimizer development, but maybe someone can help me on

this

question.

PostgreSQL generates all possible plans of executing the query (using

an

almost exhaustive search), then gives a cost to each plan and finally

the

cheapest one is selected for execution.

There are other methods for query optimization, one of them is based

on

plan transformations (for example, using A-Star algorithm) instead of

plan

constructions used by PostgreSQL.

Does anyone know why this method was choosen? Are there any papers or

researches about it?

Thank's a lot,
Pryscila.

--
Respectfully,

Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
http://www.enterprisedb.com/

--
Respectfully,

Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
http://www.enterprisedb.com/

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jonah H. Harris (#13)
Re: About method of PostgreSQL's Optimizer

"Jonah H. Harris" <jonah.harris@gmail.com> writes:

As for using both in the same optimizer, I could only see an algorithm such
as a customized-A* being used to planning *some* large queries. The reason I
say this is because the cost calculation, which would still need to be
breadth-first, could calculate and cache the cost of most nodes thereby
allowing you to possibly perform transformations at the tail of calculation.

We do already have two different plan search algorithms: the strict
bottom-up dynamic programming approach (System R style) and the GEQO
optimizer, which we switch to when there are too many joins needed to
allow exhaustive search. The GEQO code still depends on the normal
plan cost estimation code, but it doesn't consider every possible plan.

I've never been very happy with the GEQO code: the random component of
the algorithm means you get unpredictable (and sometimes awful) plans,
and the particular variant that we are using is really designed to solve
traveling-salesman problems. It's at best a poor fit to the join
planning problem.

So it seems interesting to me to think about replacing GEQO with a
rule-based optimizer for large join search spaces.

There are previous discussions about this in the archives, I believe.

regards, tom lane

#15Jonah H. Harris
jonah.harris@gmail.com
In reply to: Tom Lane (#14)
Re: About method of PostgreSQL's Optimizer

Tom,

I agree. There have been several occasions where GEQO has performed poorly
for me. I'll search the archives for the past discussions.

sorry for sending this to you twice Tom... forgot to hit reply all :(

On 9/14/05, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Jonah H. Harris" <jonah.harris@gmail.com> writes:

As for using both in the same optimizer, I could only see an algorithm

such

as a customized-A* being used to planning *some* large queries. The

reason I

say this is because the cost calculation, which would still need to be
breadth-first, could calculate and cache the cost of most nodes thereby
allowing you to possibly perform transformations at the tail of

calculation.

We do already have two different plan search algorithms: the strict
bottom-up dynamic programming approach (System R style) and the GEQO
optimizer, which we switch to when there are too many joins needed to
allow exhaustive search. The GEQO code still depends on the normal
plan cost estimation code, but it doesn't consider every possible plan.

I've never been very happy with the GEQO code: the random component of
the algorithm means you get unpredictable (and sometimes awful) plans,
and the particular variant that we are using is really designed to solve
traveling-salesman problems. It's at best a poor fit to the join
planning problem.

So it seems interesting to me to think about replacing GEQO with a
rule-based optimizer for large join search spaces.

There are previous discussions about this in the archives, I believe.

regards, tom lane

--
Respectfully,

Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
http://www.enterprisedb.com/

#16Jonah H. Harris
jonah.harris@gmail.com
In reply to: Jonah H. Harris (#15)
Re: About method of PostgreSQL's Optimizer

Pryscila,

For research reference, you may want to look at the work done on the
Columbia Query Optimization Framework. As I recall, I think it (or its
predecessors) had both cost and rule-based optimization. If you need the
code to it, I can dig it up on one of my old systems.

Albeit dated, another good reference for optimizer implementation is the
cascades query optimization framework.

On 9/15/05, Jonah H. Harris <jonah.harris@gmail.com> wrote:

Tom,

I agree. There have been several occasions where GEQO has performed poorly
for me. I'll search the archives for the past discussions.

sorry for sending this to you twice Tom... forgot to hit reply all :(

On 9/14/05, Tom Lane <tgl@sss.pgh.pa.us > wrote:

"Jonah H. Harris" <jonah.harris@gmail.com > writes:

As for using both in the same optimizer, I could only see an algorithm

such

as a customized-A* being used to planning *some* large queries. The

reason I

say this is because the cost calculation, which would still need to be

breadth-first, could calculate and cache the cost of most nodes

thereby

allowing you to possibly perform transformations at the tail of

calculation.

We do already have two different plan search algorithms: the strict
bottom-up dynamic programming approach (System R style) and the GEQO
optimizer, which we switch to when there are too many joins needed to
allow exhaustive search. The GEQO code still depends on the normal
plan cost estimation code, but it doesn't consider every possible plan.

I've never been very happy with the GEQO code: the random component of
the algorithm means you get unpredictable (and sometimes awful) plans,
and the particular variant that we are using is really designed to solve

traveling-salesman problems. It's at best a poor fit to the join
planning problem.

So it seems interesting to me to think about replacing GEQO with a
rule-based optimizer for large join search spaces.

There are previous discussions about this in the archives, I believe.

regards, tom lane

--
Respectfully,

Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
http://www.enterprisedb.com/

--
Respectfully,

Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
http://www.enterprisedb.com/