Can the query planner create indexes?

Started by Dario Beraldiover 15 years ago14 messagesgeneral
Jump to latest
#1Dario Beraldi
dario.beraldi@ed.ac.uk

Hello,

This question is just for my curiosity...

When an index is available for a query, the planner decides whether to
use it or not depending on whether it would make the query perform
better, right? However if an index, which does not exist, would make
the query run better the planner is not able (allowed?) to create such
index, use it, and drop it once the query is done. Why is it so?
Why is the query planner not allowed to create indexes, but only
allowed to use or not use what's available?

Am I misunderstanding the way the planner works? Any explanation &
clarification much appreciated!

All the best and Christmas wishes!
Dario

--

Dr. Dario Beraldi
Institute of Evolutionary Biology
University of Edinburgh
West Mains Road
Edinburgh EH9 3JT
Scotland, UK

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

#2Gabriele Bartolini
gabriele.bartolini@2ndQuadrant.it
In reply to: Dario Beraldi (#1)
Re: Can the query planner create indexes?

Ciao Dario,

On Tue, 21 Dec 2010 09:14:36 +0000, Dario Beraldi
<dario.beraldi@ed.ac.uk> wrote:

the query run better the planner is not able (allowed?) to create
such
index, use it, and drop it once the query is done. Why is it so?

Because it is not its responsibility. This is the simplest and most
rational answer.

I would reply with some questions that hopefully will give you the
answer. How can you define a 'good' query from a 'bad' query? Consider
the case when an user launches a wrong query and the planner you propose
starts creating an index. I believe that would be a mess.

Then ... how could you project this scenario in a concurrent context
where multiple users launch queries that 'need' an index?

I suggest that you look at the documentation for more information.
Otherwise, I strongly suggest that you read the chapter on the planner
from Greg's book on High Performance (which you can find from here:
http://www.postgresql.org/docs/books/)

Merry Christmas to you too!

Cheers,
Gabriele

--
Gabriele Bartolini - 2ndQuadrant Italia
PostgreSQL Training, Services and Support
Gabriele.Bartolini@2ndQuadrant.it - www.2ndQuadrant.it

In reply to: Dario Beraldi (#1)
Re: Can the query planner create indexes?

Hello Dario,

When an index is available for a query, the planner decides whether to use

it or not depending on whether it would make the query perform better,
right? However if an index, which does not exist, would make the query run
better the planner is not able (allowed?) to create such index, use it, and
drop it once the query is done. Why is it so?

From my knowledge there is more then one answer:

a) There is a proposal (and, at the time being) also some code on pgfoundry
creating "hypothetical indexes"
http://postgresql.1045698.n5.nabble.com/Hypothetical-Indexes-PostgreSQL-extension-PGCON-2010-td3288085.html
http://archives.postgresql.org/pgsql-hackers/2010-12/msg00087.php

The idea is: to play with "what-would-be-if-there-would-be-an-index".

With keywords "hypothetical index", "Index advisor" and "virtual index"
there is some research googleable

b) creating an index requires to read the data-to-be-indexed. So, to have an
index pointing at the interesting rows for your query, the table has to be
read ... which would be the perfect time to allready select the interesting
rows. And after having the interesting rows: the index is worthless

c) PostgreSQL is in fact doing something quite similiar like "temporary
indexes during the query"; only for the selected rows. The words to google
for is "hash maps", "bitmap access".

Why is the query planner not allowed to create indexes, but only allowed to

use or not use what's available?

as in b): Creating an index is quite expensiv

additionally: having multiple possible plans is also creating a new decision
problem: which of the possible plans will lead to the better result; again
with some meanings of "better": faster result or less processor usage or
less memory usage or less disk accesses. So adding additional indices during
planning would worsen this problem; which has to be balanced against
possible gains.

So: hypothetical indizes are a good idea, BUT current limitations would most
likely force them to be made outside the life query process.

Best wishes,

Harald

--
GHUM GmbH
Harald Armin Massa
Spielberger Straße 49
70435 Stuttgart
0173/9409607

Amtsgericht Stuttgart, HRB 734971
-
persuadere.
et programmare

#4Gabriele Bartolini
gabriele.bartolini@2ndQuadrant.it
In reply to: Massa, Harald Armin (#3)
Re: Can the query planner create indexes?

Hi Harald,

On Tue, 21 Dec 2010 11:42:40 +0100, "Massa, Harald Armin"
<chef@ghum.de> wrote:

a) There is a proposal (and, at the time being) also some code on
pgfoundry creating "hypothetical indexes"

http://postgresql.1045698.n5.nabble.com/Hypothetical-Indexes-PostgreSQL-extension-PGCON-2010-td3288085.html
[1]
http://archives.postgresql.org/pgsql-hackers/2010-12/msg00087.php [2]

I totally missed this. Thanks for posting it.

Cheers,
Gabriele

--
Gabriele Bartolini - 2ndQuadrant Italia
PostgreSQL Training, Services and Support
Gabriele.Bartolini@2ndQuadrant.it - www.2ndQuadrant.it

#5Dario Beraldi
dario.beraldi@ed.ac.uk
In reply to: Gabriele Bartolini (#2)
Re: Can the query planner create indexes?

Ok, thanks a lot to all of you for your answers! (Always impressed by
the prompt feedback you get on this list!)

Quoting Gabriele Bartolini <Gabriele.Bartolini@2ndQuadrant.it>:

Ciao Dario,

On Tue, 21 Dec 2010 09:14:36 +0000, Dario Beraldi
<dario.beraldi@ed.ac.uk> wrote:

the query run better the planner is not able (allowed?) to create such
index, use it, and drop it once the query is done. Why is it so?

Because it is not its responsibility. This is the simplest and most
rational answer.

I would reply with some questions that hopefully will give you the
answer. How can you define a 'good' query from a 'bad' query?
Consider the case when an user launches a wrong query and the
planner you propose starts creating an index. I believe that would
be a mess.

Then ... how could you project this scenario in a concurrent context
where multiple users launch queries that 'need' an index?

I suggest that you look at the documentation for more information.
Otherwise, I strongly suggest that you read the chapter on the
planner from Greg's book on High Performance (which you can find
from here: http://www.postgresql.org/docs/books/)

Merry Christmas to you too!

Cheers,
Gabriele

--
Gabriele Bartolini - 2ndQuadrant Italia
PostgreSQL Training, Services and Support
Gabriele.Bartolini@2ndQuadrant.it - www.2ndQuadrant.it

--

Dr. Dario Beraldi
Institute of Evolutionary Biology
University of Edinburgh
West Mains Road
Edinburgh EH9 3JT
Scotland, UK

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

#6Jeremy Harris
jgh@wizmail.org
In reply to: Massa, Harald Armin (#3)
Re: Can the query planner create indexes?

On 2010-12-21 10:42, Massa, Harald Armin wrote:

b) creating an index requires to read the data-to-be-indexed. So, to have an
index pointing at the interesting rows for your query, the table has to be
read ... which would be the perfect time to allready select the interesting
rows. And after having the interesting rows: the index is worthless

... until another similar query comes along, when suddenly it's a massive win.
Why not auto-create indices for some limited period after database load
(copy? any large number of inserts from a single connection?), track those
that actually get re-used and remove the rest? Would this not provide
a better out-of-the-box experience for neophytes?

[...]

Why is the query planner not allowed to create indexes, but only allowed to

use or not use what's available?

as in b): Creating an index is quite expensiv

How much more so than doing that full-table-scan plus sort, which your
query is doing anyway?

Cheers,
Jeremy

#7Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jeremy Harris (#6)
Re: Can the query planner create indexes?

On 2010-12-21 10:42, Massa, Harald Armin wrote:

b) creating an index requires to read the data-to-be-indexed. So, to
have an
index pointing at the interesting rows for your query, the table has to
be
read ... which would be the perfect time to allready select the
interesting
rows. And after having the interesting rows: the index is worthless

... until another similar query comes along, when suddenly it's a massive
win.
Why not auto-create indices for some limited period after database load
(copy? any large number of inserts from a single connection?), track
those
that actually get re-used and remove the rest? Would this not provide
a better out-of-the-box experience for neophytes?

Say you have a table with several columns (A,B,C), and the query is using
some of them. What indexes would you create? One index on every column? A
multi-column index on all columns? Indexs for each combination of columns?

There really is no automatic way to solve this puzzle using a single
query. Indexing strategy is a very tough design discipline, and it
requires a complex knowledge of the workload. One slow query does not mean
the index should be created - what if that was just an ad-hoc query and
will not be executed ever again?

So although this 'automatic index creation' seems nice, it really does not
work in practice.

I really don't want to work with products that try to be smarter than me
(and I admit I'm pretty dumb from time to time) and act rather randomly
due to this 'artificial intelligence'. I've already been there and I don't
want to repeat this experience.

[...]

Why is the query planner not allowed to create indexes, but only allowed
to

use or not use what's available?

as in b): Creating an index is quite expensiv

How much more so than doing that full-table-scan plus sort, which your
query is doing anyway?

A lot. It consumes a lot of CPU, it may consume a lot of memory (up to
maintenance_work_mem) etc. So imagine a few users, building indices on a
big table simultaneously - no, I really don't want to see this on a
production server.

Building an index is just one side of the problem - maintenance of the
indexes is another thing. Each index has an impact on write operations
(INSERT/UPDATE) and may cause that HOT actually does not work (thus
causing unnecessary bloat of the index). And the indexes may need lot of
space on a disk.

But the real show stopper is probably locking. Building an index takes a
write lock on a table, effectively blocking writes. Sure, you can use a
'CREATE INDEX ... CONCURRENTLY' introduced in 8.2, but there are some
disadvantages of that (see
http://www.postgresql.org/docs/9.0/static/sql-createindex.html#SQL-CREATEINDEX-CONCURRENTLY).

cheers
Tomas

#8Grzegorz Jaśkiewicz
gryzman@gmail.com
In reply to: Tomas Vondra (#7)
Re: Can the query planner create indexes?

I don't think planner should do things like creating an index. But it
might hint at doing it in the logs.
There was a discussion around that sort of feature on -hackers not so
long time ago. I don't remember what the conclusion was, but probably
that it just isn't worth wasting planner's cycles when you execute a
query.

#9Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Grzegorz Jaśkiewicz (#8)
Re: Can the query planner create indexes?

I don't think planner should do things like creating an index. But it
might hint at doing it in the logs.
There was a discussion around that sort of feature on -hackers not so
long time ago. I don't remember what the conclusion was, but probably
that it just isn't worth wasting planner's cycles when you execute a
query.

Yes, that would be a much better solution, definitely. Something like
'watch seq-scan nodes and if the number of matching rows is much smaller
than the total number, put a HINT into log.'

I was thinking about doing something like this for correlated columns
(when the number of matching rows is severely underestimated). But that's
far in the future.

Tomas

#10Jeremy Harris
jgh@wizmail.org
In reply to: Tomas Vondra (#7)
Re: Can the query planner create indexes?

On 2010-12-21 14:26, tv@fuzzy.cz wrote:

Why not auto-create indices for some limited period after database load
(copy? any large number of inserts from a single connection?), track
those
that actually get re-used and remove the rest? Would this not provide
a better out-of-the-box experience for neophytes?

Say you have a table with several columns (A,B,C), and the query is using
some of them. What indexes would you create? One index on every column? A
multi-column index on all columns? Indexs for each combination of columns?

One single-column index, on the first index-worthy column appearing.
Keep it simple. Maybe, while you're doing that full-table-scan. gather
stats on all the indexable columns for later reference, to guide choice of
which column to index later.

There really is no automatic way to solve this puzzle using a single
query. Indexing strategy is a very tough design discipline, and it
requires a complex knowledge of the workload. One slow query does not mean
the index should be created - what if that was just an ad-hoc query and
will not be executed ever again?

Then the index you just built gets automatically dropped, as I said above.

I really don't want to work with products that try to be smarter than me
(and I admit I'm pretty dumb from time to time) and act rather randomly
due to this 'artificial intelligence'. I've already been there and I don't
want to repeat this experience.

Then, since you're not a neophyte, leave the feature turned off. But don't
deny the possibility of using it to actual neophytes.

as in b): Creating an index is quite expensiv

How much more so than doing that full-table-scan plus sort, which your
query is doing anyway?

A lot. It consumes a lot of CPU, it may consume a lot of memory (up to
maintenance_work_mem) etc. So imagine a few users, building indices on a
big table simultaneously

Pffthht. One simple trylock, used only by the auto-indexer.

Building an index is just one side of the problem - maintenance of the
indexes is another thing. Each index has an impact on write operations
(INSERT/UPDATE) and may cause that HOT actually does not work (thus
causing unnecessary bloat of the index).

This is worth discussing. How much help does the DBMS currently give
the DBA in evaluating these tradeoffs? Could we do better, given an
actual installation and workload?

And the indexes may need lot of
space on a disk.

By all means require limits as well as a "don't do that" switch.

But the real show stopper is probably locking. Building an index takes a
write lock on a table, effectively blocking writes. Sure, you can use a
'CREATE INDEX ... CONCURRENTLY' introduced in 8.2, but there are some
disadvantages of that (see
http://www.postgresql.org/docs/9.0/static/sql-createindex.html#SQL-CREATEINDEX-CONCURRENTLY).

The only disadvantages I see there are a) the leftover "invalid" index - which feels like a
bug; why is it not auto-dropped? and b) the second scan, which implies more total work
and a wish to background that portion after completing the query triggering
the auto-index.

Don't forget I suggested doing this only for a limited time after DB creation. Possibly another
reason to turn it off should be any manual index creation, as that implies that the DBA
knows about indexing.

I see in another thread you suggest merely placing hints in the log. That's a fine
first step - but I'll then be wanting to auto-parse that log to auto-create....

Cheers,
Jeremy

#11bricklen
bricklen@gmail.com
In reply to: Jeremy Harris (#10)
Re: Can the query planner create indexes?

On Tue, Dec 21, 2010 at 7:34 AM, Jeremy Harris <jgh@wizmail.org> wrote:

On 2010-12-21 14:26, tv@fuzzy.cz wrote:

Why not auto-create indices for some limited period after database load
(copy?  any large number of inserts from a single connection?), track
those
that actually get re-used and remove the rest?   Would this not provide
a better out-of-the-box experience for neophytes?

Say you have a table with several columns (A,B,C), and the query is using
some of them. What indexes would you create? One index on every column? A
multi-column index on all columns? Indexs for each combination of columns?

One single-column index, on the first index-worthy column appearing.
Keep it simple.    Maybe, while you're doing that full-table-scan. gather
stats on all the indexable columns for later reference, to guide choice of
which column to index later.

There really is no automatic way to solve this puzzle using a single
query. Indexing strategy is a very tough design discipline, and it
requires a complex knowledge of the workload. One slow query does not mean
the index should be created - what if that was just an ad-hoc query and
will not be executed ever again?

Then the index you just built gets automatically dropped, as I said above.

I really don't want to work with products that try to be smarter than me
(and I admit I'm pretty dumb from time to time) and act rather randomly
due to this 'artificial intelligence'. I've already been there and I don't
want to repeat this experience.

Then, since you're not a neophyte, leave the feature turned off.   But don't
deny the possibility of using it to actual neophytes.

as in b): Creating an index is quite expensiv

How much more so than doing that full-table-scan plus sort, which your
query is doing anyway?

A lot. It consumes a lot of CPU, it may consume a lot of memory (up to
maintenance_work_mem) etc. So imagine a few users, building indices on a
big table simultaneously

Pffthht.   One simple trylock, used only by the auto-indexer.

Building an index is just one side of the problem - maintenance of the
indexes is another thing. Each index has an impact on write operations
(INSERT/UPDATE) and may cause that HOT actually does not work (thus
causing unnecessary bloat of the index).

This is worth discussing.    How much help does the DBMS currently give
the DBA in evaluating these tradeoffs?    Could we do better, given an
actual installation and workload?

And the indexes may need lot of
space on a disk.

By all means require limits as well as a "don't do that" switch.

But the real show stopper is probably locking. Building an index takes a
write lock on a table, effectively blocking writes. Sure, you can use a
'CREATE INDEX ... CONCURRENTLY' introduced in 8.2, but there are some
disadvantages of that (see

http://www.postgresql.org/docs/9.0/static/sql-createindex.html#SQL-CREATEINDEX-CONCURRENTLY).

The only disadvantages I see there are a) the leftover "invalid" index -
which feels like a
bug; why is it not auto-dropped?   and b) the second scan, which implies
more total work
and a wish to background that portion after completing the query triggering
the auto-index.

Don't forget I suggested doing this only for a limited time after DB
creation.  Possibly another
reason to turn it off should be any manual index creation, as that implies
that the DBA
knows about indexing.

I see in another thread you suggest merely placing hints in the log.
That's a fine
first step - but I'll then be wanting to auto-parse that log to
auto-create....

Cheers,
   Jeremy

This thread offers up some interesting possibilities. Expanding on
what has already been discussed, maybe a contrib module for query
tuning/index suggestions?
Some things that came to mind immediately that the module could do
(feasible or not):
- Look at the EXPLAIN ANALYZE
- Examine the index access methods and table scans + costs/rows
- Which indexes were used?
- What were the blocks/tuples hit & read?
- Look at join conditions and WHERE clause filters
- Data types in the joins (mismatched?)
- Churn rate of the tables, eg. the updates/deletes/inserts. This
might allow suggestion of other index types (eg. gist)

Tool then provides feedback on possibly helpful indexes to test, and
why (hypothetical indexes could be applied here). Possibly provided
suggestions on ways to improve the query, eg. data types don't match
in the join, EXISTS vs IN, etc

#12Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jeremy Harris (#10)
Re: Can the query planner create indexes?

Dne 21.12.2010 16:34, Jeremy Harris napsal(a):

There really is no automatic way to solve this puzzle using a single
query. Indexing strategy is a very tough design discipline, and it
requires a complex knowledge of the workload. One slow query does not
mean
the index should be created - what if that was just an ad-hoc query and
will not be executed ever again?

Then the index you just built gets automatically dropped, as I said above.

I'm a bit confused. Should the indexes be dropped automatically (as you
state here) or kept for the future. Because if they should be dropped,
then it does not make sense to do this magic just for a limited time
after the DB goes live.

I really don't want to work with products that try to be smarter than me
(and I admit I'm pretty dumb from time to time) and act rather randomly
due to this 'artificial intelligence'. I've already been there and I
don't
want to repeat this experience.

Then, since you're not a neophyte, leave the feature turned off. But
don't deny the possibility of using it to actual neophytes.

This is not a question of whether I am a neophyte or not, this is a
question of good practices. My experience is that building indexes
should be done when designing the application, and tested at in a test
environment. At production, log slow queries, analyze the log and add
indexes when needed.

This 'neophyte approach' is a really bad idea from my point of view. It
inevitably leads to a database with a zillion of unnecessary indexes,
missing the right ones. And any limits don't prevent this. In the end
you'll get a slow database, and the neophytes would blame the database
although that state is inevitable.

I've been working with several database products over the years, and
AFAIK none of them does this. In most cases "the whole industry is
wrong" is usually a really bad sign (does not hold if you're Steve Jobs).

Most of them do have 'advisors' though - that's a good idea, and I think
sooner or later that will be in PostgreSQL too.

as in b): Creating an index is quite expensiv

How much more so than doing that full-table-scan plus sort, which your
query is doing anyway?

A lot. It consumes a lot of CPU, it may consume a lot of memory (up to
maintenance_work_mem) etc. So imagine a few users, building indices on a
big table simultaneously

Pffthht. One simple trylock, used only by the auto-indexer.

Pffthht? Anyway trylock is not an answer. Think about a different table
for each user.

Building an index is just one side of the problem - maintenance of the
indexes is another thing. Each index has an impact on write operations
(INSERT/UPDATE) and may cause that HOT actually does not work (thus
causing unnecessary bloat of the index).

This is worth discussing. How much help does the DBMS currently give
the DBA in evaluating these tradeoffs? Could we do better, given an
actual installation and workload?

The database gives you explain plans, plus pg_stat_* and pg_statio_*
views. That's a lot of data, although does not provide a complete view
in many cases. And then there's a log_min_duration, which is the primary
weapon in fighting slow queries after going live.

And the indexes may need lot of
space on a disk.

By all means require limits as well as a "don't do that" switch.

As I said, this is not a good approach from my POV. And it's really
really hard to implement this in a way that those who don't want to use
it don't have to pay the price. Because what portion of users would
really use this feature? 0.0001%?

But the real show stopper is probably locking. Building an index takes a
write lock on a table, effectively blocking writes. Sure, you can use a
'CREATE INDEX ... CONCURRENTLY' introduced in 8.2, but there are some
disadvantages of that (see
http://www.postgresql.org/docs/9.0/static/sql-createindex.html#SQL-CREATEINDEX-CONCURRENTLY).

The only disadvantages I see there are a) the leftover "invalid" index -
which feels like a
bug; why is it not auto-dropped? and b) the second scan, which implies
more total work
and a wish to background that portion after completing the query triggering
the auto-index.

Dropping an index automatically is a really bad idea. Actually dropping
indexes in general (no matter if the index was created automatically or
manually) is a very tricky thing.

How do you decide which indexes are really "leftover" from those that
are used only rarely? I've seen application failing spectacularily
because a DBA dropped an index that was not used ... except for a batch
process that runs once a year, to close a fiscal year.

Don't forget I suggested doing this only for a limited time after DB
creation. Possibly another
reason to turn it off should be any manual index creation, as that
implies that the DBA
knows about indexing.

I see in another thread you suggest merely placing hints in the log.
That's a fine
first step - but I'll then be wanting to auto-parse that log to
auto-create....

Well, I don't think that is going to happen. I'm not going to implement
this (I've explained my opinions on this above), and I don't think this
would get into core anyway.

Logging? Probably. Advisors? Maybe, although as a contrib module.
Creating indexes automatically? I doubt that.

regards
Tomas

#13Jeremy Harris
jgh@wizmail.org
In reply to: Tomas Vondra (#12)
Re: Can the query planner create indexes?

On 2010-12-21 18:50, Tomas Vondra wrote:

Then the index you just built gets automatically dropped, as I said above.

I'm a bit confused. Should the indexes be dropped automatically (as you
state here) or kept for the future. Because if they should be dropped,
then it does not make sense to do this magic just for a limited time
after the DB goes live.

Here what I said:
"track those that actually get re-used and remove the rest".

Which part is confusing?

- Jeremy

#14Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Jeremy Harris (#13)
Re: Can the query planner create indexes?

Dne 21.12.2010 20:03, Jeremy Harris napsal(a):

On 2010-12-21 18:50, Tomas Vondra wrote:

Then the index you just built gets automatically dropped, as I said
above.

I'm a bit confused. Should the indexes be dropped automatically (as you
state here) or kept for the future. Because if they should be dropped,
then it does not make sense to do this magic just for a limited time
after the DB goes live.

Here what I said:
"track those that actually get re-used and remove the rest".

Which part is confusing?

As I described, identifying which indexes are actually used is a very
tricky task. And it's not difficult to come up with scenarios where this
causes significantly more harm than good.

Basically the time to keep the indices needs to be long enough that the
indexes that are actually used are not dropped (and the resources spent
creating them actually pays off). But on the other side it needs to be
short so that resources are not wasted because of unused indices.

Which are clearly contradictory requirements.

And the 'limits' you've proposed make that even worse, because when the
unnecessary indices get created first and take most of the resources
(e.g. disk space), then the indexes that are actually needed won't be
created because of those limits.

regards
Tomas