Re: order by and index path

Started by Andreas Zeugswetterover 27 years ago9 messageshackers
Jump to latest
#1Andreas Zeugswetter
andreas.zeugswetter@telecom.at

Jan wrote:

If there is an ORDER BY clause, using an index scan is the
clever way if the indexqual dramatically reduces the the
amount of data selected and sorted. I think this is the
normal case

yes

(who really selects nearly all rows from a 5M row
table?).

Data Warehouse apps

This will hurt if someone really selects most of the rows and the index
scan jumps over the disc.

I think this is a non issue, since if a qual is not restrictive enough,
the optimizer should choose a seq scan anyway. Doesn' t it do this already ?

But here the programmer should use
an unqualified query to perform a seqscan and do the
qualification in the frontend application.

I would reformulate this to:
Here the backend should do a seq scan and use the qualification to eliminate
not wanted rows.

Resumee:
You have to look at this from the cost point of view. If there is an order by that can be
done with an index, this will make the index a little more preferrable than for the same
query without the order by, but it should not force the index.
You have to give the sort a cost, so that the index access can be compared to the
seq scan and sort path.

Andreas

#2Jan Wieck
JanWieck@Yahoo.com
In reply to: Andreas Zeugswetter (#1)

Andread wrote:

(who really selects nearly all rows from a 5M row
table?).

Data Warehouse apps

This will hurt if someone really selects most of the rows and the index
scan jumps over the disc.

I think this is a non issue, since if a qual is not restrictive enough,
the optimizer should choose a seq scan anyway. Doesn' t it do this already ?
[...]

Right and wrong.

Right - it is the optimizers job to decide if an index should
be used or not. And the decision has to be made based on the
cost.

Wrong - PostgreSQL's query optimizer doesn't do it already.
It assumes that a qualification is always restrictive enough
and chooses an index scan any time if the qualification can
be thrown into the indexqual.

In the following I only discuss the situation where
qualifications can be used in the indexqual.

Calculating the cost of a query is easy. Have N tuples in P
data-pages where the given qualification will match M.
Assuming that the tuples are not in ascending order in the
data pages, the cost fetching one tuple by its TID raises
with P (more seeking necessary). Now you can calculate the
cost of an index scan by C=M/N*P*F where F is some constant
factor to make C comparable to a current seqscan cost value
(I know, it must be smarter, but for this description a
simple calculation is enough).

The only problem is that the optimizer has absolutely no
chance to estimate M (the mystic value as Bruce called it).
In a given qualification

WHERE key > 0 AND key <= 100

it cannot know if this would result in 0 or 100% of the rows.
To estimate that, it needs statistical information about the
key ranges that are present. Assume there would be 11 keys
remembered by the last vacuum run, that break up the whole
present key range of 10000 tuples into 10 chunks and they
read

5 40 70 90 500 600 1000 1100 1400 1500 2000

where 5 is the lowest key at all, 40 is the key of tuple 1000
(in key order), 70 is the key of tuple 2000 and so on. Now
looking at the qualification and this key range information
would tell, that the absolute limit of rows returned by an
index scan would be 3999 (which still could have a key value
of 100). But the qualification

WHERE key >= 50

could return at max 8999 tuples and

WHERE key > 50 AND key < 70

has a maximum of 998 result tuples. This would be the
information required to make the right decision for the case
where all rows selected are wanted.

We do not have this statistical information. So the whole
thing is at this time academic.

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#======================================== jwieck@debis.com (Jan Wieck) #

#3Thomas Lockhart
lockhart@alumni.caltech.edu
In reply to: Jan Wieck (#2)
Re: [HACKERS] Re: order by and index path

We do not have this statistical information. So the whole
thing is at this time academic.

I recall that Commercial Ingres made the assumption that one row (or 1%
of rows? My memory of Ingres is fading :) would be returned from a
qualified query if no statistics were available to suggest otherwise.

It did collect statistics on data distribution to try to help make those
optimizer choices.

It may be reasonable to assume that if there is an index, then using it
with any qualified query would be a win. Since the alternative is to
decide to _not_ use an index, a decision for which we have no support
with existing statistics.

- Tom

#4Jan Wieck
JanWieck@Yahoo.com
In reply to: Thomas Lockhart (#3)
Statistics on key distribution (was: Re: order by and index path)

We do not have this statistical information. So the whole
thing is at this time academic.

I recall that Commercial Ingres made the assumption that one row (or 1%
of rows? My memory of Ingres is fading :) would be returned from a
qualified query if no statistics were available to suggest otherwise.

It did collect statistics on data distribution to try to help make those
optimizer choices.

It may be reasonable to assume that if there is an index, then using it
with any qualified query would be a win. Since the alternative is to
decide to _not_ use an index, a decision for which we have no support
with existing statistics.

It may be also reasonable to collect statistic information
and use that to quantify the cost of an index scan.

The vacuum cleaner scans all indices on a relation vacuum'd
completely. And at that time it already knows the number of
pages and tuples in the heap relation (has that in the
vcrelstats).

Based on this it could decide to take every n'th index tuple
while scanning and drop them somewhere where other backends
can find them. This would be the statistical information
needed by the optimizer to estimate the real cost of an index
scan. It is only of interest for big tables, where hopping
from block to block will make an index scan a looser against
a seqscan in a many row matching scan. So it's up to the
optimizer do decide based on the # of pages if statistical
information is really required for cost calculation.

Having the final indexqual along with the statistical
information it will be a little tricky to figure out how many
rows it might return, but not impossible.

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#======================================== jwieck@debis.com (Jan Wieck) #

#5Bruce Momjian
bruce@momjian.us
In reply to: Andreas Zeugswetter (#1)
Re: [HACKERS] Re: order by and index path

Jan wrote:

If there is an ORDER BY clause, using an index scan is the
clever way if the indexqual dramatically reduces the the
amount of data selected and sorted. I think this is the
normal case

yes

(who really selects nearly all rows from a 5M row
table?).

Data Warehouse apps

This will hurt if someone really selects most of the rows and the index
scan jumps over the disc.

I think this is a non issue, since if a qual is not restrictive enough,
the optimizer should choose a seq scan anyway. Doesn' t it do this already ?

Yes it does.

But here the programmer should use
an unqualified query to perform a seqscan and do the
qualification in the frontend application.

I would reformulate this to:
Here the backend should do a seq scan and use the qualification to eliminate
not wanted rows.

Resumee:
You have to look at this from the cost point of view. If there is an order by that can be
done with an index, this will make the index a little more preferrable than for the same
query without the order by, but it should not force the index.
You have to give the sort a cost, so that the index access can be compared to the
seq scan and sort path.

This cost is compared. The optimizer uses the min-max values for the
column stored in pg_statistic to see how much of the table is being
requested, and decides on an index or not.

Doing the restriction on the fontend sounds kind of cheesy to me.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#6Bruce Momjian
bruce@momjian.us
In reply to: Jan Wieck (#2)
Re: [HACKERS] Re: order by and index path

could return at max 8999 tuples and

WHERE key > 50 AND key < 70

has a maximum of 998 result tuples. This would be the
information required to make the right decision for the case
where all rows selected are wanted.

We do not have this statistical information. So the whole
thing is at this time academic.

But we do have statistical information in pg_statistic if you run vacuum
analyze.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#7Bruce Momjian
bruce@momjian.us
In reply to: Thomas Lockhart (#3)
Re: [HACKERS] Re: order by and index path

We do not have this statistical information. So the whole
thing is at this time academic.

I recall that Commercial Ingres made the assumption that one row (or 1%
of rows? My memory of Ingres is fading :) would be returned from a
qualified query if no statistics were available to suggest otherwise.

It did collect statistics on data distribution to try to help make those
optimizer choices.

It may be reasonable to assume that if there is an index, then using it
with any qualified query would be a win. Since the alternative is to
decide to _not_ use an index, a decision for which we have no support
with existing statistics.

For =, the assumion is 1 row, for > the assumption is 1/3 of the table.
With pg_statistic, it uses that.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#8Marty Scholes
marty@outputservices.com
In reply to: Bruce Momjian (#7)
Re: [HACKERS] Re: order by and index path

(who really selects nearly all rows from a 5M row
table?).

I have to comment here from a practical standpoint. My company does the above
dozens of times per day using Oracle 7 under Solaris. This is a *VERY*
common operation in data processing organizations (like mine) running large
batch jobs in tight time windows. At least under Oracle 7 (not sure about
ProgreSQL, still running tests), it is impractical to sort an extraction
using ORDER BY, but *MUCH* cheaper to index the fields wanted and then
artificially constrain the query
to convince the oprimizer to use the index for extraction. You can
get the whole dataset in any order you want with few resources. Again, we
do it all the time.

Marty

#9Jan Wieck
JanWieck@Yahoo.com
In reply to: Bruce Momjian (#6)
Re: [HACKERS] Re: order by and index path

could return at max 8999 tuples and

WHERE key > 50 AND key < 70

has a maximum of 998 result tuples. This would be the
information required to make the right decision for the case
where all rows selected are wanted.

We do not have this statistical information. So the whole
thing is at this time academic.

But we do have statistical information in pg_statistic if you run vacuum
analyze.

Nice (forgot that - pardon), anyway only having lowest and
highest key values isn't enough to make a useful estimation
about how many rows an indexqual will return. If we change
pg_statistic in a way that more keys can get stored per
relation/attribute, then the optimizer would have a real
chance on it.

I have

starelid
staattnum
staitupno
staop
stakey

in mind, where staitupno tells the position of the key in a
complete index scan. Then it becomes the place to fill in the
key range information as described in my posting.

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#======================================== jwieck@debis.com (Jan Wieck) #