About b-tree usage

Started by Ioannis Theoharisalmost 21 years ago9 messages
#1Ioannis Theoharis
theohari@ics.forth.gr

Please let me know, if there is any option in postgresql to achieve the
following usage of a b-tree index:

For a relation R(att0, att1) and a btree index on attribute att0

In each insertion of a tuple on table:
- look on index if the value of att0 of new entry does already exist in
index, and
- if no, allow the aprorpiate entry on b-tree
- if yes, do not allow an entry.

In my aplication i have always my relation clustered according to att0.
And the only information needed for a query with a range condition over
att0 in WHERE clause, is the place on disc where the first tuple with a
given value on att0 is placed.

The hint, is that beacause of too many raws of table, the index size is
too big. But the number of discrete values of att0 is orders of
magnitudes smaller than the number of tuples.

I try to investigate, if there is a way to use an alternative of b-tree
index, to decrease the blocks of indexed that are fetched into memory.

Thanks.

#2Jeff Davis
jdavis-pgsql@empires.org
In reply to: Ioannis Theoharis (#1)
Re: About b-tree usage

If I understand your question, you want to reduce the index size by only
pointing to the first tuple in a table with a given key in att0, since
the rest of the tuples will be right afterward (because you keep the
table clustered on that key).

However, from the docs:
<http://www.postgresql.org/docs/8.0/static/sql-cluster.html&gt;

"When a table is clustered, it is physically reordered based on the
index information. Clustering is a one-time operation: when the table is
subsequently updated, the changes are not clustered. That is, no attempt
is made to store new or updated rows according to their index order. If
one wishes, one can periodically recluster by issuing the command
again."

So, there is no guarantee that there won't be tuples with the same key
in a different physical location between your CLUSTER commands. That
means your index idea won't really work.

Perhaps you can alter your table layout/design to better suit your
needs.

For example:
* If you have a very small number of values in att0, you could make a
different table for each one and make a view that's the union of those
tables.
* You could make att1 an array. Yes, it is a horrible design from a
relational standpoint, but if you need performance, there may not be
many other options. You can then use set-returning functions and other
functions to make it behave more like a relation. Ideally, postgresql
would have relation-valued attributes, but currently arrays seem like
the best option available for simulating a relation-valued attribute.

If there are many identical values in att0, are you sure a sequential
scan isn't more efficient? Also, are you sure the index isn't working
well? It seems to me since you have the table clustered, it might be
fairly efficient as-is (it would get a huge benefit from the spatial
locality of the tuples in the table). Index size alone shouldn't destroy
your performance, since the idea of an index lookup is that it only has
to read O(log n) pages from the disk per lookup.

Regards,
Jeff Davis

Show quoted text

On Sun, 2005-03-06 at 23:33 +0200, Ioannis Theoharis wrote:

Please let me know, if there is any option in postgresql to achieve the
following usage of a b-tree index:

For a relation R(att0, att1) and a btree index on attribute att0

In each insertion of a tuple on table:
- look on index if the value of att0 of new entry does already exist in
index, and
- if no, allow the aprorpiate entry on b-tree
- if yes, do not allow an entry.

In my aplication i have always my relation clustered according to att0.
And the only information needed for a query with a range condition over
att0 in WHERE clause, is the place on disc where the first tuple with a
given value on att0 is placed.

The hint, is that beacause of too many raws of table, the index size is
too big. But the number of discrete values of att0 is orders of
magnitudes smaller than the number of tuples.

I try to investigate, if there is a way to use an alternative of b-tree
index, to decrease the blocks of indexed that are fetched into memory.

Thanks.

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

#3Ioannis Theoharis
theohari@ics.forth.gr
In reply to: Jeff Davis (#2)
Re: About b-tree usage

If there are many identical values in att0, are you sure a sequential
scan isn't more efficient? Also, are you sure the index isn't working
well? It seems to me since you have the table clustered, it might be
fairly efficient as-is (it would get a huge benefit from the spatial
locality of the tuples in the table). Index size alone shouldn't destroy
your performance, since the idea of an index lookup is that it only has
to read O(log n) pages from the disk per lookup.

In the next example, have in mind that:
select relname, relpages, reltuples from pg_class;

relname | relpages | reltuples
--------------------------------+----------+-------------
...
tc2000000000 | 142858 | 1.00001e+06
inst_id_idx | 2745 | 1e+06
...

and that i run postgresql, on a UltraSPARC[tm] III 600MHz, ram: 512MB
OS : sol 9

att0: varchar(1000)
att1: int4
and that 0<=att1>=900000000 for every tuple of tabe and index.

query:
select att0 from tc2000000000 where att1=900000000 AND att1>=0

plan:
Index Scan using inst_id_idx on tc2000000000 (cost=0.00..161603.06
rows=1000006 width=1004) (actual time=41.21..101917.36 rows=1000000 loops=1)
Index Cond: ((att1 <= 900000000) AND (att1 >= 0))
Total runtime: 103135.03 msec

query:
select att0 from tc2000000000

plan:
Seq Scan on tc2000000000 (cost=100000000.00..100152858.06 rows=1000006
width=1004) (actual time=0.21..42584.87 rows=1000000 loops=1)
Total runtime: 43770.73 msec

Can you explain me this big difference? Perhaps postgresql caches in
memory a big part (or the whole) of index?

And by the way why postgresql doesn't select sequential scan? (I have
done vacuum analyze).

#4Jeff Davis
jdavis-pgsql@empires.org
In reply to: Ioannis Theoharis (#3)
Re: About b-tree usage

In that case, sequential scan is faster, but perhaps the planner doesn't
know that ahead of time. Try turning on more statistics if you haven't
already, and then run ANALYZE again. If the planner sees a range,
perhaps it assumes that it is a highly selective range, when in fact, it
consists of all of the tuples. Also, make sure enable_seqscan is true
(in case you turned it off for testing or something and forgot).

A seqscan is usually faster when a large proportion of the tuples are
returned because:
(1) It uses sequential I/O; whereas an index might access tuples in a
random order.
(2) It doesn't have to read the index's disk pages at all.

I suspect you don't need to return all the tuples in the table. If you
include the details of a real scenario perhaps the people on the list
could be more helpful.

Regards,
Jeff Davis

Show quoted text

On Mon, 2005-03-07 at 20:43 +0200, Ioannis Theoharis wrote:

If there are many identical values in att0, are you sure a sequential
scan isn't more efficient? Also, are you sure the index isn't working
well? It seems to me since you have the table clustered, it might be
fairly efficient as-is (it would get a huge benefit from the spatial
locality of the tuples in the table). Index size alone shouldn't destroy
your performance, since the idea of an index lookup is that it only has
to read O(log n) pages from the disk per lookup.

In the next example, have in mind that:
select relname, relpages, reltuples from pg_class;

relname | relpages | reltuples
--------------------------------+----------+-------------
...
tc2000000000 | 142858 | 1.00001e+06
inst_id_idx | 2745 | 1e+06
...

and that i run postgresql, on a UltraSPARC[tm] III 600MHz, ram: 512MB
OS : sol 9

att0: varchar(1000)
att1: int4
and that 0<=att1>=900000000 for every tuple of tabe and index.

query:
select att0 from tc2000000000 where att1=900000000 AND att1>=0

plan:
Index Scan using inst_id_idx on tc2000000000 (cost=0.00..161603.06
rows=1000006 width=1004) (actual time=41.21..101917.36 rows=1000000 loops=1)
Index Cond: ((att1 <= 900000000) AND (att1 >= 0))
Total runtime: 103135.03 msec

query:
select att0 from tc2000000000

plan:
Seq Scan on tc2000000000 (cost=100000000.00..100152858.06 rows=1000006
width=1004) (actual time=0.21..42584.87 rows=1000000 loops=1)
Total runtime: 43770.73 msec

Can you explain me this big difference? Perhaps postgresql caches in
memory a big part (or the whole) of index?

And by the way why postgresql doesn't select sequential scan? (I have
done vacuum analyze).

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Ioannis Theoharis (#3)
Re: About b-tree usage

Ioannis Theoharis <theohari@ics.forth.gr> writes:

select att0 from tc2000000000 where att1=900000000 AND att1>=0

plan:
Index Scan using inst_id_idx on tc2000000000 (cost=0.00..161603.06
rows=1000006 width=1004) (actual time=41.21..101917.36 rows=1000000 loops=1)
Index Cond: ((att1 <= 900000000) AND (att1 >= 0))
Total runtime: 103135.03 msec

query:
select att0 from tc2000000000

plan:
Seq Scan on tc2000000000 (cost=100000000.00..100152858.06 rows=1000006
width=1004) (actual time=0.21..42584.87 rows=1000000 loops=1)
Total runtime: 43770.73 msec

Can you explain me this big difference? Perhaps postgresql caches in
memory a big part (or the whole) of index?

seqscan is naturally faster when you have to visit the whole table anyway.

And by the way why postgresql doesn't select sequential scan?

Because you've set enable_seqscan = off.

regards, tom lane

#6Ioannis Theoharis
theohari@ics.forth.gr
In reply to: Jeff Davis (#4)
Re: About b-tree usage

let me, i have turned enable_seqscan to off, in order to discourage
optimizer to choose seq_scan whenever an idex_scan can be used.

But in this case, why optimizer don't chooses seq_scan (discourage is
different than prevent) ?

At many cases i need only a small fragment of raws to be retrieved. But
this extreme case is a real-scenario (not the most frequent but real).

I try to find a way to achieve good performence even for the extreme
case. Is there any way?

ps. In bibliografy, there is a different alternative for indices. except
th simple approach of <attr_val, rid> is the alternative <attr_val, set
of rids>. The second means the attaches to each discrete attr_val the set
o rid's of all raws with same attr_val. Is this alternative taken into
account in postgres?

On Mon, 7 Mar 2005, Jeff Davis wrote:

Show quoted text

In that case, sequential scan is faster, but perhaps the planner doesn't
know that ahead of time. Try turning on more statistics if you haven't
already, and then run ANALYZE again. If the planner sees a range,
perhaps it assumes that it is a highly selective range, when in fact, it
consists of all of the tuples. Also, make sure enable_seqscan is true
(in case you turned it off for testing or something and forgot).

A seqscan is usually faster when a large proportion of the tuples are
returned because:
(1) It uses sequential I/O; whereas an index might access tuples in a
random order.
(2) It doesn't have to read the index's disk pages at all.

I suspect you don't need to return all the tuples in the table. If you
include the details of a real scenario perhaps the people on the list
could be more helpful.

#7Michael Paesold
mpaesold@gmx.at
In reply to: Ioannis Theoharis (#1)
Re: About b-tree usage

Ioannis Theoharis wrote:

let me, i have turned enable_seqscan to off, in order to discourage
optimizer to choose seq_scan whenever an idex_scan can be used.

But in this case, why optimizer don't chooses seq_scan (discourage is
different than prevent) ?

You probably know that PostgreSQL uses a cost-based optimizer. The optimizer
chooses different plans based on the cost it calculates for them.

enable_seqscan = OFF is very primitive but effective: it tells the optimizer
to raise the cost of a sequential scan to a value going towards infinity.

When it comes to the choice between seq scan and index scan, the optimizer
will now always choose the index scan. It does not "known" anymore if
sequential scan would be cheaper -- *you* have told the optimizer that it is
not.

Only when there is no other way except seq scan to execute your query at
all, then the optimizer must choose this very costly path. An example is an
unqualified SELECT * FROM table; -- there is no path with an index here.

I hope that answers your first question. As you see, enable_seqscan = OFF
should not be used for production systems, but only for debugging. Perhaps
it's useful to set at query level, but not in postgresql.conf.

Best Regards,
Michael Paesold

#8Klaus Naumann
lists@distinctmind.de
In reply to: Ioannis Theoharis (#6)
Re: About b-tree usage

Hi,

if you're using a pg version prio to 8.0 your pitfall might also be
a conversion between int and bigint datatypes.
So if you're doing somthing like

SELECT a.x, b.y, c.y FROM a, b WHERE a.x = b.x;

and a.x is INT4 and b.x is INT8 (or BIGINT) the planner counts this as
a data conversion and uses a full table scan.

Greetings, Klaus

Ioannis Theoharis wrote:

Show quoted text

let me, i have turned enable_seqscan to off, in order to discourage
optimizer to choose seq_scan whenever an idex_scan can be used.

But in this case, why optimizer don't chooses seq_scan (discourage is
different than prevent) ?

At many cases i need only a small fragment of raws to be retrieved. But
this extreme case is a real-scenario (not the most frequent but real).

I try to find a way to achieve good performence even for the extreme
case. Is there any way?

ps. In bibliografy, there is a different alternative for indices. except
th simple approach of <attr_val, rid> is the alternative <attr_val, set
of rids>. The second means the attaches to each discrete attr_val the set
o rid's of all raws with same attr_val. Is this alternative taken into
account in postgres?

On Mon, 7 Mar 2005, Jeff Davis wrote:

In that case, sequential scan is faster, but perhaps the planner doesn't
know that ahead of time. Try turning on more statistics if you haven't
already, and then run ANALYZE again. If the planner sees a range,
perhaps it assumes that it is a highly selective range, when in fact, it
consists of all of the tuples. Also, make sure enable_seqscan is true
(in case you turned it off for testing or something and forgot).

A seqscan is usually faster when a large proportion of the tuples are
returned because:
(1) It uses sequential I/O; whereas an index might access tuples in a
random order.
(2) It doesn't have to read the index's disk pages at all.

I suspect you don't need to return all the tuples in the table. If you
include the details of a real scenario perhaps the people on the list
could be more helpful.

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

#9Jeff Davis
jdavis-pgsql@empires.org
In reply to: Ioannis Theoharis (#6)
Re: About b-tree usage

On Tue, 2005-03-08 at 15:30 +0200, Ioannis Theoharis wrote:

let me, i have turned enable_seqscan to off, in order to discourage
optimizer to choose seq_scan whenever an idex_scan can be used.

But in this case, why optimizer don't chooses seq_scan (discourage is
different than prevent) ?

As Michael Paesold pointed out, enable_seqscan=off effectively means
that it will never choose a seqscan when there exists an alternative.

At many cases i need only a small fragment of raws to be retrieved. But
this extreme case is a real-scenario (not the most frequent but real).

Returning all the rows in a large table is a real scenario? I would
expect you to at least use a cursor. Keep in mind that libpq has to
store all those rows in local client memory, so a cursor is a good idea
if that is the case.

And regardless, if you are returning all the rows in a table, the
absolute fastest way possible is a seq scan. An index is much slower
because it does way more work, and the disk accesses are random rather
than sequential.

I try to find a way to achieve good performence even for the extreme
case. Is there any way?

The extreme case you provided is a SELECT without a WHERE. We already
know that PostgreSQL is executing that as efficiently as possible when
enable_seqscan=on. Why not send me a simple script that generates some
data similar to what you want to access, and then the query you'd like
to perform on that data. We might find that PostgreSQL is already
executing the query as efficiently as possible; in fact I suspect that's
the case (although perhaps some tuning would help).

In short, we shouldn't try to use indexes for all situations; they are
inherently worse for some situations. That's why PostgreSQL has both
seqscans and indexes, and an intelligent planner.

ps. In bibliografy, there is a different alternative for indices. except
th simple approach of <attr_val, rid> is the alternative <attr_val, set
of rids>. The second means the attaches to each discrete attr_val the set
o rid's of all raws with same attr_val. Is this alternative taken into
account in postgres?

I don't entirely understand what you're asking. It seems like you're
talking about a relatively minor implementation issue, and if it does
make a difference, it would probably not be very much. Perhaps rephrase
your question?

Show quoted text

On Mon, 7 Mar 2005, Jeff Davis wrote:

In that case, sequential scan is faster, but perhaps the planner doesn't
know that ahead of time. Try turning on more statistics if you haven't
already, and then run ANALYZE again. If the planner sees a range,
perhaps it assumes that it is a highly selective range, when in fact, it
consists of all of the tuples. Also, make sure enable_seqscan is true
(in case you turned it off for testing or something and forgot).

A seqscan is usually faster when a large proportion of the tuples are
returned because:
(1) It uses sequential I/O; whereas an index might access tuples in a
random order.
(2) It doesn't have to read the index's disk pages at all.

I suspect you don't need to return all the tuples in the table. If you
include the details of a real scenario perhaps the people on the list
could be more helpful.

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly