Optimizer fails?

Started by Michal Mosiewiczalmost 28 years ago9 messages
#1Michal Mosiewicz
mimo@interdata.com.pl

xxxx=> \d logdt

Table    = logdt
+----------------------------------+----------------------------------+-------+
|              Field               |              Type                |
Length|
+----------------------------------+----------------------------------+-------+
| dt                               | timestamp                       
|     4 |
+----------------------------------+----------------------------------+-------+
xxxx=> explain select * from log where dt < '19980203';
NOTICE:  QUERY PLAN:

Seq Scan on log (cost=105832.15 size=699588 width=62)

There is an index on log table, dt field. The index is b-tree.
However it doesn't seem to be used. (Of course I have vacuumed it). The
table has about 2M records. I don't think that Seq Scan is a good idea.

Also, what if I agregate it on dt field to count(*) or sum some values.
It would be sequentially scanned, then sorted, then grouped and finally
agregated, right?

Well, sometimes it may be good enough. But if this table is big enough,
it would be wiser to use index to select ranges from the table and then
agregate those values without sorting.

Once I saw index based agregates in the TODO list, but somehow it
disappeared.

Regards,
Mike

--
WWW: http://www.lodz.pdi.net/~mimo tel: Int. Acc. Code + 48 42 148340
add: Michal Mosiewicz * Bugaj 66 m.54 * 95-200 Pabianice * POLAND

#2Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Michal Mosiewicz (#1)
Re: [HACKERS] Optimizer fails?

Seq Scan on log (cost=105832.15 size=699588 width=62)

There is an index on log table, dt field. The index is b-tree.
However it doesn't seem to be used. (Of course I have vacuumed it). The
table has about 2M records. I don't think that Seq Scan is a good idea.

Also, what if I agregate it on dt field to count(*) or sum some values.
It would be sequentially scanned, then sorted, then grouped and finally
agregated, right?

I assume you have vacuum analyze too? It may help.

-- 
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)
#3Noname
dg@illustra.com
In reply to: Michal Mosiewicz (#1)
Re: [HACKERS] Optimizer fails?

Michal Mosiewicz asks:

xxxx=> \d logdt

Table    = logdt
+----------------------------------+----------------------------------+-------+
|              Field               |              Type                |
Length|
+----------------------------------+----------------------------------+-------+
| dt                               | timestamp                       
|     4 |
+----------------------------------+----------------------------------+-------+
xxxx=> explain select * from log where dt < '19980203';
NOTICE:  QUERY PLAN:

Seq Scan on log (cost=105832.15 size=699588 width=62)

There is an index on log table, dt field. The index is b-tree.
However it doesn't seem to be used. (Of course I have vacuumed it). The
table has about 2M records. I don't think that Seq Scan is a good idea.

This is almost certainly correct optimizer behavior, unless there are very few
rows that would be qualified by "< '19980203'". Do you know how many rows
were returned?

The key to all this is that scanning sequentially is very cheap. Disks like
it, the OS filesystem is optimized for it, striping across multiple devices
is a big win... It is easy to scan 10 Mb per second on PC class hardward.

Using an index means using random I/O for each row. Unless the target table
fits in the buffer cache this means you are limited to examining a few
dozen rows per second (good disks can sustain about 80 to 100 random IOs
per second).

Back of the envelope calculation:

Assume

rowsize = 100 bytes.
Sequential I/O rate = 1Mb/second.
Random I/O rate = 50 IO/second.

Then

2M rows @ 10,000 rows per Mb = 200 MB = 200 seconds.

200 seconds * 50 IO per second = 10,000 IOs

So

If less than 10,000 rows (0.5 % of the table) would qualify, the index
scan might be faster. Otherwise the table scan is optimal.

This calculation ignores the overhead of actually scanning the index and
probably underestimates the sequential I/O rate a so in real life, the table
scan is may be even better than this.

Also, what if I agregate it on dt field to count(*) or sum some values.
It would be sequentially scanned, then sorted, then grouped and finally
agregated, right?

Well, sometimes it may be good enough. But if this table is big enough,
it would be wiser to use index to select ranges from the table and then
agregate those values without sorting.

Assuming a table and query like:

create table foo (d date, m money); - 2 M rows of this, index on d.

select d, sum(m) from foo
group by d;

I would be very surprised if there existed a better plan than to sort the
table and then scan the sorted table (once) to do the grouping and
aggregation.

If you knew the number of 'd' values was smallish, you might try scanning
the table and building a hash containing the d values and aggregates.

I don't know if postgreSQL has this strategy. It is fairly uncommon to
actually find it implemented in real systems.

Once I saw index based agregates in the TODO list, but somehow it
disappeared.

Do you mean using covering indexes? This would be very worthwhile, but might
be a bit of a job.

-dg

David Gould dg@illustra.com 510.628.3783 or 510.305.9468
Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612
- Linux. Not because it is free. Because it is better.

#4Noname
dg@illustra.com
In reply to: Noname (#3)
Re: [HACKERS] Optimizer fails?

Michal Mosiewicz writes:

Say you have a table with 1024 indexed values from 0 to 1023. Now, you
want to select the values less than, say, 64. If you use sequential
scan, you have to scan 1024 records, right? But if you use index scan,
you first go to the first node of the tree, you compare 64 with the
value stored in this node, which (for balanced tree) is 512. Now you you
have to search through next lower node. You compare 64 to 128 which is
the value of this node. Then you go to next lower node. You compare 64
to 64, and yes, we hit the target. After 3 searches, we know that
everything under this node is the set of records we are looking for
right? So, what we have to do is to browse through those values, and
collect the tupple indentifiers.

Note, that once we have done this 3 searches, we have to browse all the
structure of the tree below. We are looking for 64 values. So, it will
cost us looking through 128 nodes of the subtree.

OK, so by using an index, we had to check 128 + 3 nodes of the tree.

Our btree indexes are quite a bit better than the balanced tree you suppose.

Now, let's note, that there has been only a few IO transfers by now. No
more than few pages. And we have tupple identifiers pointing us to 64
records. Now we may sort this tids in ascending order to optimise IO.

But, we do not do this tid sort. It really isn't easy as you might have
millions of tids, not just a few. Which would mean doing an external sort.
This might be a nice thing to do, but it isn't there now as far as I know.

Everything took us 3 + 128 nodes from index + 64 records from table.
This is defnitely better than reading all 1024 records.

I am not trying to flame you, but it seems to me that you have some
misconceptions about how the indexes work and I am trying only to explain
them a little better.

Using your example of 1024 rows with values from 0 to 1023. Further assuming:

- 8K pages
- 200 byte data rows (including overheads)
- 4 byte keys, so perhaps 32 byte index rows with overheads
- btree index

Then:

Table has 8192 / 200 = 40 rows per page giving 1024 / 40 = 26 pages

Index has 8192 / 32 = 256 keys per page giving 1024 / 256 = 4 leaf pages
and one root page with 4 keys.

Something like:
____________________
index root: | 0, 256, 512, 768 |
--------------------
/ \ \
/ \ \
/ \ \--------------\
/ \ \
_______________ _________________ ____________
leaf0: | 0,1,2...255 | leaf1: | 256,257...511 | leaf2: ....
--------------- ----------------- ------------
| \ \--------------------------------\
| \------------------\ \
| \ \
| \ \
data: | 234 |, | 11 |, | 763 |, | 401 |, | 29 |, | 970 |, | 55 |, ....

To scan the index to get the tids for keys 0...63 will take two page
reads: root page, leaf1.

But, to access the data rows you still need 64 random IOs.

So the total times to read the data rows for keys 0..63 look like:

Using index:

IOs time why

     1         20msec     read root
     1	       20msec     read leaf0
    64       1280msec     read 64 rows from leaf pages
   ---      ---------
    66       1320msec     total

Using table scan:

IOs time why

     1         20msec     seek and read 1st page
    25        125msec     sequential read (5 msec each) of rest of table
   ---       --------
    26        145msec     total

Note that this ignores cache effects.

If you assume the cache is big enough for the whole table (and in this case
it is, and assume none of the table is resident initially (cold start) then
the IO analysis is:

Using index (with cache):

IOs time why

     1         20msec     read root
     1	       20msec     read leaf0
    10        200msec     read 10 unique data pages (assumes desired keys are
                          not uniformily distributed, this favors index case)
   ---      ---------
    12        240msec     total

Using table scan (with cache):

IOs time why

     1         20msec     seek and read 1st page
    25        125msec     sequential read (5 msec each) of rest of table
   ---       --------
    26        145msec     total (no difference from uncached case)

Even with the very favorable assumption that the only part of the data pages
need to be read, the sequential scan is still slower here.

And note, that this is a very simple example. In my case I had a 250MB
table, and about 1% of records to select from it.

My initial calculation a few posts ago was that the break even was .5% for
the table you described so I still think it is reasonable to do a table scan.

-dg

David Gould dg@illustra.com 510.628.3783 or 510.305.9468
Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612
- Linux. Not because it is free. Because it is better.

#5Michal Mosiewicz
mimo@interdata.com.pl
In reply to: Noname (#4)
Re: [HACKERS] Optimizer fails?

David Gould wrote:

Now, let's note, that there has been only a few IO transfers by now. No
more than few pages. And we have tupple identifiers pointing us to 64
records. Now we may sort this tids in ascending order to optimise IO.

But, we do not do this tid sort. It really isn't easy as you might have
millions of tids, not just a few. Which would mean doing an external sort.
This might be a nice thing to do, but it isn't there now as far as I know.

No, you don't need a full set. You may sort it in portions of the
predefined size. You may even try ascending/descending order to optimise
your hd heads movements, however it may be not very good idea, since
it's against the read-ahead feature of most disk IO and it may
negatively influence ORDER BY performance. Anyhow, you may accomplish
sawtooth readings, that certainly decrease the access time.

Everything took us 3 + 128 nodes from index + 64 records from table.
This is defnitely better than reading all 1024 records.

I am not trying to flame you, but it seems to me that you have some
misconceptions about how the indexes work and I am trying only to explain
them a little better.
[cut]
Using index (with cache):

IOs time why

1         20msec     read root
1         20msec     read leaf0
10        200msec     read 10 unique data pages (assumes desired keys are
not uniformily distributed, this favors index case)
---      ---------

OK, this example may be simple. In fact I would agree that in this case
figures looks like seq scan is sufficient. But that's all started from
my example of 250MB file with about 2M of records, that I wanted to
select only about 25k. (I ommit the fact, that those record were
naturally clustered by the time they came into database. So those 20.000
of records were pretty continuos).

As I observed postgres scanned this database at rate of
100kBps(sequentially). Much less than the actuall I/O throughput on this
machine. Even when I prepared a condition to return no records it also
scanned it sequentially, while it would cost only 20msec.

Anyhow... I have to admit that similiar question asked to mysql takes...
mysql> select count(*) from log where dt < 19980209000000 and
dt>19980208000000;
+----------+
| count(*) |
+----------+
| 26707 |
+----------+
1 row in set (7.61 sec)

Of course, if I ask it without the index it takes ~3 minutes. That's why
expected that postgres would make some use of index. (The table is in
both cases the same).

Mike

--
WWW: http://www.lodz.pdi.net/~mimo tel: Int. Acc. Code + 48 42 148340
add: Michal Mosiewicz * Bugaj 66 m.54 * 95-200 Pabianice * POLAND

#6Noname
dg@illustra.com
In reply to: Michal Mosiewicz (#5)
Re: [HACKERS] Optimizer fails?

Michal Mosiewicz:

As I observed postgres scanned this database at rate of
100kBps(sequentially). Much less than the actuall I/O throughput on this
machine. Even when I prepared a condition to return no records it also
scanned it sequentially, while it would cost only 20msec.

Well, now it looks like there is a bug or two:

- 100kBps(sequentially) is way too slow. If you have time, try profileing
(with gprof) this scan. We should be able to do much better than this.
If you can't do it, we might want to put "Improve sequential scan rate"
on the todo list.

- a "select count(*) from x where <some_index_col> <some_qual>"
should use the index.

Anyhow... I have to admit that similiar question asked to mysql takes...
mysql> select count(*) from log where dt < 19980209000000 and
dt>19980208000000;
+----------+
| count(*) |
+----------+
| 26707 |
+----------+
1 row in set (7.61 sec)

Of course, if I ask it without the index it takes ~3 minutes. That's why
expected that postgres would make some use of index. (The table is in
both cases the same).

Just out of curiosity, how long do these queries take in MySQL vs postgreSQL?

Thanks
-dg

David Gould dg@illustra.com 510.628.3783 or 510.305.9468
Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612
- Linux. Not because it is free. Because it is better.

#7Michal Mosiewicz
mimo@interdata.com.pl
In reply to: Noname (#6)
Re: [HACKERS] Optimizer fails?

David Gould wrote:

- 100kBps(sequentially) is way too slow. If you have time, try profileing
(with gprof) this scan. We should be able to do much better than this.
If you can't do it, we might want to put "Improve sequential scan rate"
on the todo list.

OK. I'll profile it as soon as I find some spare moment.

Of course, if I ask it without the index it takes ~3 minutes. That's why
expected that postgres would make some use of index. (The table is in
both cases the same).

Just out of curiosity, how long do these queries take in MySQL vs postgreSQL?

MySQL seems to be aproximately 20 times better. Using a sequential scan
I can obserwe on IO monitor that it's able to read 2-5MB/s. (This is
two-disk RAID0 configuration, that has maximum throughput of 10MBps).

Of course, you have to remember that mySQL has much simpler storage due
to lack of some features like transactions. I've read some Monty's (the
author) thoughts on transactions. He says that introducing transactions
would lower the performance at least 4-5 times (even read performance).

Now, I've downloaded personal version of Solid that I'm also going to
compare. What I personally find very interesting in Solid is it's
optimistic locking. Actually it means no locking at all. Solid seems to
have non-overwritting feature much like postgres. But this
non-overwritting feature is limited to not-checkpointed-yet data in
buffers (logs). Also, it maintains an internal structure called 'Bonsai
Tree', that includes all transaction's with their time. If there is a
collision between data it is deducted from this bonsai tree structure,
and then the latest transaction is rolled back.

By using systematic checkpointing it makes sure that those structures
are relatively small and conflicts are resolved quickly.

Of course, don't treat it as comparisions between postgres and
commercial product. I just want to share it as a kind of 'food for
thoughts'.

Mike

--
WWW: http://www.lodz.pdi.net/~mimo tel: Int. Acc. Code + 48 42 148340
add: Michal Mosiewicz * Bugaj 66 m.54 * 95-200 Pabianice * POLAND

#8Vadim B. Mikheev
vadim@sable.krasnoyarsk.su
In reply to: Noname (#4)
Re: [HACKERS] Optimizer fails?

David Gould wrote:

Now, let's note, that there has been only a few IO transfers by now. No
more than few pages. And we have tupple identifiers pointing us to 64
records. Now we may sort this tids in ascending order to optimise IO.

But, we do not do this tid sort. It really isn't easy as you might have
millions of tids, not just a few. Which would mean doing an external sort.
This might be a nice thing to do, but it isn't there now as far as I know.

Using TID as (last) part of index key is on my TODO.
This will speed up vacuuming, get rid of all duplicate key
problems and give us feature above.

To scan the index to get the tids for keys 0...63 will take two page
reads: root page, leaf1.

+ meta page read first - to get root page block number.

Vadim

#9Zeugswetter Andreas SARZ
Andreas.Zeugswetter@telecom.at
In reply to: Vadim B. Mikheev (#8)
AW: [HACKERS] Optimizer fails?

There is one comment I would like to state, on the issue of
a sequential scan beeing faster than an index scan. It is actually
often
true in a singel user system that an index scan is more expensive
than a sequential scan.
As long as we have table level locks this is also true for heavyly
concurrent access.
Here comes the disadvantage:
Once row or page locks will be implemented, the sequential scan
cost should be reconsidered, since then readers will often be
waiting
for updaters, that are actually updating data, that is irrelevant
for the
reader. The average wait time will have to be added to the sequ.
scan
cost.

Andreas