Index of a table is not used (in any case)
Hello PostgreSQl Users!
PostSQL V 7.1.1:
I have defined a table and the necessary indices.
But the index is not used in every SELECT. (Therefore, the selects are
*very* slow, due to seq scan on
20 million entries, which is a test setup up to now)
The definitions can be seen in the annex.
Does some body know the reason and how to circumvent the seq scan?
Is the order of index creation relevant? I.e., should I create the
indices before inserting
entries or the other way around?
Should a hashing index be used? (I tried this, but I got the known error
"Out of overflow pages")
(The docu on "create index" says :
"Notes
The Postgres query optimizer will consider using a btree index
whenever an indexed attribute is involved in a
comparison using one of: <, <=, =, >=, >
The Postgres query optimizer will consider using an rtree index
whenever an indexed attribute is involved in a
comparison using one of: <<, &<, &>, >>, @, ~=, &&
The Postgres query optimizer will consider using a hash index
whenever an indexed attribute is involved in a
comparison using the = operator. "
The table entry 'epoche' is used in two different indices. Should that
be avoided?
Any suggestions are welcome.
Thank you in advance.
Reiner
------------------------------
Annex:
======
Table:
------
\d wetter
Table "wetter"
Attribute | Type | Modifier
-----------+--------------------------+----------
sensor_id | integer | not null
epoche | timestamp with time zone | not null
wert | real | not null
Indices: wetter_epoche_idx,
wetter_pkey
\d wetter_epoche_idx
Index "wetter_epoche_idx"
Attribute | Type
-----------+--------------------------
epoche | timestamp with time zone
btree
\d wetter_pkey
Index "wetter_pkey"
Attribute | Type
-----------+--------------------------
sensor_id | integer
epoche | timestamp with time zone
unique btree (primary key)
Select where index is used:
============================
explain select * from wetter order by epoche desc;
NOTICE: QUERY PLAN:
Index Scan Backward using wetter_epoche_idx on wetter
(cost=0.00..3216018.59 rows=20340000 width=16)
EXPLAIN
Select where the index is NOT used:
===================================
explain select * from wetter where epoche between '1970-01-01' and
'1980-01-01' order by epoche asc;
NOTICE: QUERY PLAN:
Sort (cost=480705.74..480705.74 rows=203400 width=16)
-> Seq Scan on wetter (cost=0.00..454852.00 rows=203400 width=16)
EXPLAIN
--
Mit freundlichen Gruessen / With best regards
Reiner Dassing
Reiner Dassing <dassing@wettzell.ifag.de> writes:
Hello PostgreSQl Users!
PostSQL V 7.1.1:
I have defined a table and the necessary indices.
But the index is not used in every SELECT. (Therefore, the selects are
*very* slow, due to seq scan on
20 million entries, which is a test setup up to now)
Perennial first question: did you VACUUM ANALYZE?
-Doug
--
Let us cross over the river, and rest under the shade of the trees.
--T. J. Jackson, 1863
Import Notes
Reply to msg id not found: ReinerDassingsmessageofMon22Oct2001084240+0200
Hello PostgreSQl Users!
PostSQL V 7.1.1:
You should upgrade to 7.1.3 at some point...
I have defined a table and the necessary indices.
But the index is not used in every SELECT. (Therefore, the selects are
*very* slow, due to seq scan on
20 million entries, which is a test setup up to now)The definitions can be seen in the annex.
Does some body know the reason and how to circumvent the seq scan?
Yes. You probably have not run 'VACUUM ANALYZE' on your large table.
Is the order of index creation relevant? I.e., should I create the
indices before inserting
entries or the other way around?
If you are inserting a great many entries, insert the data first and then
create the indices - it will be much faster this way.
Should a hashing index be used? (I tried this, but I got the known error
"Out of overflow pages")
Just do the default CREATE INDEX - btree should be fine... (probably)
The table entry 'epoche' is used in two different indices. Should that
be avoided?
It's not a problem, but just check your EXPLAIN output after the VACUUM to
check that you have them right.
Chris
On Mon, 22 Oct 2001, Reiner Dassing wrote:
Hello PostgreSQl Users!
PostSQL V 7.1.1:
I have defined a table and the necessary indices.
But the index is not used in every SELECT. (Therefore, the selects are
*very* slow, due to seq scan on
20 million entries, which is a test setup up to now)The definitions can be seen in the annex.
Does some body know the reason and how to circumvent the seq scan?
Is the order of index creation relevant? I.e., should I create the
indices before inserting
entries or the other way around?
Have you run a vacuum analyze to update the statistics after the data was
loaded?
Reinier,
For future notice, [SQL] is the correct list for this kind of inquiry.
Please do not post it to [HACKERS]. And please don't cross-post ... it
results in a lot of needless duplication of effort.
I have defined a table and the necessary indices.
Is the order of index creation relevant? I.e., should I create the
indices before inserting
entries or the other way around?
Ummm ... not to be obvious, or anything, but did you VACCUUM ANALYZE
after populating your table?
There's also some special steps to take if you are regularly deleting
large numbers of records.
-Josh
______AGLIO DATABASE SOLUTIONS___________________________
Josh Berkus
Complete information technology josh@agliodbs.com
and data management solutions (415) 565-7293
for law firms, small businesses fax 621-2533
and non-profit organizations. San Francisco
Hello all!
Thank you for the answers I got.
I would like to mention first, that I will use the [SQL] list for my
answers,
regarding the notice of Josh Berkus.
Q: "did you use VACUUM ANALYZE"?
A: This table was a test bed, just using INSERTS without ANY deletes or
updates
(See: vacuum verbose analyze wetter;
NOTICE: --Relation wetter--
NOTICE: Pages 149752: Changed 0, reaped 194, Empty 0, New 0;
Tup 20340000: Vac 26169, Keep/VTL 0/0, Crash 0, UnUsed 0,
MinLen 52,
MaxLen 52;
Re-using: Free/Avail. Space 1467792/1467792;
EndEmpty/Avail. Pages 0/194. CPU 6.10s/1.78u sec.
)
Q: You should upgrade to 7.1.3?
A: Can you tell me the specific the reason?
Am afraid, that the real answer is not mentioned:
Why is the index used in the SELECT:
select * from wetter order by epoche desc;
select * from wetter where epoche between '1970-01-01' and '1980-01-01'
order by epoche asc;
?
Any ideas?
--
Mit freundlichen Gruessen / With best regards
Reiner Dassing
Reiner Dassing <dassing@wettzell.ifag.de> writes:
I would like to mention first, that I will use the [SQL] list for my
answers,
regarding the notice of Josh Berkus.Q: "did you use VACUUM ANALYZE"?
A: This table was a test bed, just using INSERTS without ANY deletes or
updates
You still need to run VACUUM ANALYZE. The ANALYZE part measures the
statistics of your data, which the planner needs in order to make
decision.
Am afraid, that the real answer is not mentioned:
Why is the index used in the SELECT:
select * from wetter order by epoche desc;select * from wetter where epoche between '1970-01-01' and '1980-01-01'
order by epoche asc;
If you EXPLAIN output for these queries, someone can probably help
you.
-Doug
--
Let us cross over the river, and rest under the shade of the trees.
--T. J. Jackson, 1863
Import Notes
Reply to msg id not found: ReinerDassingsmessageofTue23Oct2001090104+0200
In article <web-490372@davinci.ethosmedia.com>, Josh Berkus wrote:
Reinier,
For future notice, [SQL] is the correct list for this kind of inquiry.
Please do not post it to [HACKERS]. And please don't cross-post ... it
results in a lot of needless duplication of effort.I have defined a table and the necessary indices.
Is the order of index creation relevant? I.e., should I create the
indices before inserting
entries or the other way around?Ummm ... not to be obvious, or anything, but did you VACCUUM ANALYZE
after populating your table?There's also some special steps to take if you are regularly deleting
large numbers of records.
Could you tell me what those steps are or where to find them? I have
a db that I delete about 1 million records a day from in a batch job.
The only special thing I do is every few days I reindex the table
involved to reclame the space burned by the indexes not reclaiming
space on deletion of rows. What other good and useful things could I
do?
Thanks
marc
Show quoted text
-Josh
______AGLIO DATABASE SOLUTIONS___________________________
Josh Berkus
Complete information technology josh@agliodbs.com
and data management solutions (415) 565-7293
for law firms, small businesses fax 621-2533
and non-profit organizations. San Francisco---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster
Doug McNaught wrote:
Reiner Dassing <dassing@wettzell.ifag.de> writes:
Hello PostgreSQl Users!
PostSQL V 7.1.1:
I have defined a table and the necessary indices.
But the index is not used in every SELECT. (Therefore, the selects are
*very* slow, due to seq scan on
20 million entries, which is a test setup up to now)Perennial first question: did you VACUUM ANALYZE?
Can there, or could there, be a notion of "rule based" optimization of
queries in PostgreSQL? The "not using index" problem is probably the most
common and most misunderstood problem.
Reiner Dassing <dassing@wettzell.ifag.de> writes:
explain select * from wetter order by epoche desc;
NOTICE: QUERY PLAN:
Index Scan Backward using wetter_epoche_idx on wetter
(cost=0.00..3216018.59 rows=20340000 width=16)
explain select * from wetter where epoche between '1970-01-01' and
'1980-01-01' order by epoche asc;
NOTICE: QUERY PLAN:
Sort (cost=480705.74..480705.74 rows=203400 width=16)
-> Seq Scan on wetter (cost=0.00..454852.00 rows=203400 width=16)
It's hard to believe that you've done a VACUUM ANALYZE on this table,
since you are getting a selectivity estimate of exactly 0.01, which
just happens to be the default selectivity estimate for range queries.
How many rows are there really in this date range?
Anyway, the reason the planner is picking a seqscan+sort is that it
thinks that will be faster than an indexscan. It's not necessarily
wrong. Have you compared the explain output and actual timings both
ways? (Use "set enable_seqscan to off" to force it to pick an indexscan
for testing purposes.)
regards, tom lane
*very* slow, due to seq scan on
20 million entries, which is a test setup up to now)Perennial first question: did you VACUUM ANALYZE?
Can there, or could there, be a notion of "rule based" optimization of
queries in PostgreSQL? The "not using index" problem is probably the
most
common and most misunderstood problem.
There is a (sort of) rule based behavior in PostgreSQL,
the down side of the current implementation is, that certain
other commands than ANALYZE (e.g. "create index") partly update
optimizer statistics. This is bad behavior, since then only part
of the statistics are accurate. Statistics always have to be seen
in context to other table's and other index'es statistics.
Thus, currently the rule based optimizer only works if you create
the indexes on empty tables (before loading data), which obviously
has downsides. Else you have no choice but to ANALYZE frequently.
I have tried hard to fight for this pseudo rule based behavior,
but was only partly successful in convincing core. My opinion is,
that (unless runtime statistics are kept) no other command than
ANALYZE should be allowed to touch optimizer relevant statistics
(maybe unless explicitly told to).
Andreas
Import Notes
Resolved by subject fallback
Zeugswetter Andreas SB SD wrote:
*very* slow, due to seq scan on
20 million entries, which is a test setup up to now)Perennial first question: did you VACUUM ANALYZE?
Can there, or could there, be a notion of "rule based" optimization of
queries in PostgreSQL? The "not using index" problem is probably themost
common and most misunderstood problem.
There is a (sort of) rule based behavior in PostgreSQL,
the down side of the current implementation is, that certain
other commands than ANALYZE (e.g. "create index") partly update
optimizer statistics. This is bad behavior, since then only part
of the statistics are accurate. Statistics always have to be seen
in context to other table's and other index'es statistics.Thus, currently the rule based optimizer only works if you create
the indexes on empty tables (before loading data), which obviously
has downsides. Else you have no choice but to ANALYZE frequently.I have tried hard to fight for this pseudo rule based behavior,
but was only partly successful in convincing core. My opinion is,
that (unless runtime statistics are kept) no other command than
ANALYZE should be allowed to touch optimizer relevant statistics
(maybe unless explicitly told to).
Perhaps there could be an extension to ANALYZE, i.e. ANALYZE RULEBASED
tablename that would restore or recalculate the state that a table would be if
all indexes were created from scratch?
The "not using index" was very frustrating to understand. The stock answer,
"did you vacuum?" just isn't enough. There has to be some explanation (in the
FAQ or something) about the indexed key distribution in your data. Postgres'
statistics are pretty poor too, a relative few very populous entries in a table
will make it virtually impossible for the cost based optimizer (CBO) to use an
index.
At my site we have lots of tables that have many duplicate items in an index.
It is a music based site and has a huge amount of "Various Artists" entries. No
matter what we do, there is NO way to get Postgres to use the index from the
query alone. We have over 20 thousand artists, but 5 "Various Artists" or
"Soundtrack" entries change the statistics so much that they exclude an index
scan. We have to run the system with sequential scan disabled. Running with seq
disabled eliminates the usefulness of the CBO because when it is a justified
table scan, it does an index scan.
I have approached this windmill before and a bit regretful at bringing it up
again, but it is important, very important. There needs to be a way to direct
the optimizer about how to optimize the query.
Using "set foo=bar" prior to a query is not acceptable. Web sites use
persistent connections to the databases and since "set" can not be restored,
you override global settings for the session, or have to code, in the web page,
the proper default setting. The result is either that different web processes
will behave differently depending on the order in which they execute queries,
or you have to have your DBA write web pages.
A syntax like:
select * from table where /* enable_seqscan = false */ key = 'value';
Would be great in that you could tune the optimizer as long as the settings
were for the clause directly following the directive, without affecting the
state of the session or transaction. For instance:
select id from t1, t2 where /* enable_seqscan = false */ t1.key = 'value' and
t2.key = 'test' and t1.id = t2.id;
The where "t1.key = 'value'" condition would be prohibited from using a
sequntial scan, while the "t2.key = 'test'" would use it if it made sense.
Is this possible?
mlw <markw@mohawksoft.com> writes:
... Postgres' statistics are pretty poor too, a relative few very
populous entries in a table will make it virtually impossible for the
cost based optimizer (CBO) to use an index.
Have you looked at development sources lately?
regards, tom lane
mlw writes:
The "not using index" was very frustrating to understand. The stock answer,
"did you vacuum?" just isn't enough. There has to be some explanation (in the
FAQ or something) about the indexed key distribution in your data.
Most "not using index" questions seem to be related to a misunderstanding
of users to the effect that "if there is an index it must be used, not
matter what the query", which is of course far from reality. Add to that
the (related) category of inquiries from people that think the index ought
to be used but don't have any actual timings to show, you have a lot of
people that just need to be educated.
Of course the question "did you vacuum" (better, did you analyze) is
annoying, just as the requirement to analyze is annoying in the first
place, but unless someone designs a better query planner it will have to
do. The reason why we always ask that question first is that people
invariantly have not analyzed. A seasoned developer can often tell from
the EXPLAIN output whether ANALYZE has been done, but users cannot.
Perhaps something can be done in this area, but I'm not exactly sure what.
--
Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter
Of course the question "did you vacuum" (better, did you analyze) is
annoying, just as the requirement to analyze is annoying in the first
place, but unless someone designs a better query planner it
will have to do. The reason why we always ask that question first is
that people invariantly have not analyzed.
I think it is also not allways useful to ANALYZE. There are applications
that choose optimal plans with only the rudimentary statistics VACUUM
creates. And even such that use optimal plans with only the default
statistics in place.
Imho one of the biggest sources for problems is people creating new
indexes on populated tables when the rest of the db/table has badly
outdated statistics or even only default statistics in place.
In this situation the optimizer is badly misguided, because it now
sees completely inconsistent statistics to work on.
(e.g. old indexes on that table may seem way too cheap compared
to table scan)
I would thus propose a more distinguished approach of writing
the statistics gathered during "create index" to the system tables.
Something like:
if (default stats in place)
write defaults
else if (this is the only index)
write gathered statistics
else
write only normalized statistics for index
(e.g. index.reltuples = table.reltuples;
index.relpages = (index.gathered.relpages *
table.relpages / table.gathered.relpages)
Andreas
Import Notes
Resolved by subject fallback
Hello Tom!
Tom Lane wrote:
Reiner Dassing <dassing@wettzell.ifag.de> writes:
explain select * from wetter order by epoche desc;
NOTICE: QUERY PLAN:Index Scan Backward using wetter_epoche_idx on wetter
(cost=0.00..3216018.59 rows=20340000 width=16)explain select * from wetter where epoche between '1970-01-01' and
'1980-01-01' order by epoche asc;
NOTICE: QUERY PLAN:Sort (cost=480705.74..480705.74 rows=203400 width=16)
-> Seq Scan on wetter (cost=0.00..454852.00 rows=203400 width=16)It's hard to believe that you've done a VACUUM ANALYZE on this table,
since you are getting a selectivity estimate of exactly 0.01, which
just happens to be the default selectivity estimate for range queries.
How many rows are there really in this date range?
Well, I did not claim that i made a VACUUM ANALYZE, I just set up a new
table
for testing purposes doing just INSERTs.
After VACUUM ANALYSE the results look like:
explain select * from wetter where epoche between '1970-01-01' and
test_wetter-# '1980-01-01' order by epoche asc;
NOTICE: QUERY PLAN:
Index Scan using wetter_epoche_idx on wetter (cost=0.00..3313780.74
rows=20319660 width=16)
EXPLAIN
Now, the INDEX Scan is used and therefore, the query is very fast, as
expected.
For me, as a user not being involved in all the intrinsics of
PostgreSQL, the question was
"Why is this SELECT so slow?" (this question is asked a lot of times in
this Mail lists)
Now, I would like to say thank you! You have explained me and hopefully
many more users
what is going on behind the scene.
Anyway, the reason the planner is picking a seqscan+sort is that it
thinks that will be faster than an indexscan. It's not necessarily
wrong. Have you compared the explain output and actual timings both
ways? (Use "set enable_seqscan to off" to force it to pick an indexscan
for testing purposes.)regards, tom lane
--
Mit freundlichen Gruessen / With best regards
Reiner Dassing
"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
Imho one of the biggest sources for problems is people creating new
indexes on populated tables when the rest of the db/table has badly
outdated statistics or even only default statistics in place.
In this situation the optimizer is badly misguided, because it now
sees completely inconsistent statistics to work on.
(e.g. old indexes on that table may seem way too cheap compared
to table scan)
I don't think any of this is correct. We don't have per-index
statistics. The only stats updated by CREATE INDEX are the same ones
updated by plain VACUUM, viz the number-of-tuples and number-of-pages
counts in pg_class. I believe it's reasonable to update those stats
more often than the pg_statistic stats (in fact, if we could keep them
constantly up-to-date at a reasonable cost, we'd do so). The
pg_statistic stats are designed as much as possible to be independent
of the absolute number of rows in the table, so that it's okay if they
are out of sync with the pg_class stats.
The major reason why "you vacuumed but you never analyzed" is such a
killer is that in the absence of any pg_statistic data, the default
selectivity estimates are such that you may get either an index or seq
scan depending on how big the table is. The cost estimates are
nonlinear (correctly so, IMHO, though I wouldn't necessarily defend the
exact shape of the curve) and ye olde default 0.01 will give you an
indexscan for a small table but not for a big one. In 7.2 I have
reduced the default selectivity estimate to 0.005, for a number of
reasons but mostly to get it out of the range where the decision will
flip-flop. Observe:
test71=# create table foo (f1 int);
CREATE
test71=# create index fooi on foo(f1);
CREATE
test71=# explain select * from foo where f1 = 42;
NOTICE: QUERY PLAN:
Index Scan using fooi on foo (cost=0.00..8.14 rows=10 width=4)
EXPLAIN
test71=# select reltuples,relpages from pg_class where relname = 'foo';
reltuples | relpages
-----------+----------
1000 | 10
(1 row)
EXPLAIN
test71=# update pg_class set reltuples = 100000, relpages = 1000 where relname = 'foo';
UPDATE 1
test71=# explain select * from foo where f1 = 42;
NOTICE: QUERY PLAN:
Index Scan using fooi on foo (cost=0.00..1399.04 rows=1000 width=4)
EXPLAIN
test71=# update pg_class set reltuples = 1000000, relpages = 10000 where relname = 'foo';
UPDATE 1
test71=# explain select * from foo where f1 = 42;
NOTICE: QUERY PLAN:
Seq Scan on foo (cost=0.00..22500.00 rows=10000 width=4)
EXPLAIN
test71=#
In current sources you keep getting an indexscan as you increase the
number of tuples...
regards, tom lane
Tom Lane writes:
"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:
Imho one of the biggest sources for problems is people creating new
indexes on populated tables when the rest of the db/table has badly
outdated statistics or even only default statistics in place.
In this situation the optimizer is badly misguided, because it now
sees completely inconsistent statistics to work on.
(e.g. old indexes on that table may seem way too cheap compared
to table scan)I don't think any of this is correct. We don't have per-index
statistics. The only stats updated by CREATE INDEX are the same ones
updated by plain VACUUM, viz the number-of-tuples and number-of-pages
counts in pg_class.
1. Have I said anything about other stats, than relpages and reltuples ?
2. There is only limited use in the most accurate pg_statistics if
reltuples
and relpages is completely off. In the current behavior you eg get:
rel1: pages = 100000 -- updated from "create index"
index1 pages = 2 -- outdated
index2 pages = 2000 -- current
rel2: pages = 1 -- outdated
--> Optimizer will prefer join order: rel2, rel1
I believe it's reasonable to update those stats
more often than the pg_statistic stats (in fact, if we could keep them
constantly up-to-date at a reasonable cost, we'd do so).
There is a whole lot of difference between keeping them constantly up to
date and modifying (part of) them in the "create index" command, so I do
not counter your above sentence, but imho the conclusion is wrong.
The
pg_statistic stats are designed as much as possible to be independent
of the absolute number of rows in the table, so that it's okay if they
are out of sync with the pg_class stats.
Independently, they can only be good for choosing whether to use an
index or seq scan. They are not sufficient to choose a good join order.
The major reason why "you vacuumed but you never analyzed" is such a
killer is that in the absence of any pg_statistic data, the default
selectivity estimates are such that you may get either an index or seq
scan depending on how big the table is. The cost estimates are
nonlinear (correctly so, IMHO, though I wouldn't necessarily
defend the
exact shape of the curve) and ye olde default 0.01 will give you an
indexscan for a small table but not for a big one. In 7.2 I have
reduced the default selectivity estimate to 0.005, for a number of
reasons but mostly to get it out of the range where the decision will
flip-flop.
Yes, the new selectivity is better, imho even still too high.
Imho the strategy should be to assume a good selectivity
of values in absence of pg_statistics evidence.
If the index was not selective enough for an average query, the
dba should not have created the index in the first place.
test71=# create table foo (f1 int);
test71=# create index fooi on foo(f1);
test71=# explain select * from foo where f1 = 42;
Index Scan using fooi on foo (cost=0.00..8.14 rows=10 width=4)
test71=# update pg_class set reltuples = 100000, relpages =
1000 where relname = 'foo';
Index Scan using fooi on foo (cost=0.00..1399.04 rows=1000 width=4)
test71=# update pg_class set reltuples = 1000000, relpages =
10000 where relname = 'foo';
Seq Scan on foo (cost=0.00..22500.00 rows=10000 width=4)
In current sources you keep getting an indexscan as you increase the
number of tuples...
As you can see it toppeled at 10 Mio rows :-(
Andreas
Import Notes
Resolved by subject fallback