Yet Another (Simple) Case of Index not used

Started by Denis @ Next2Meabout 23 years ago35 messagesgeneral
Jump to latest
#1Denis @ Next2Me
denis@next2me.com

Hi there,
I'm running into a quite puzzling simple example where the index I've
created on a fairly big table (465K entries) is not used, against all common
sense expectations:
The query I am trying to do (fast) is:

select count(*) from addresses;

This takes more than a second to complete, because, as the 'explain' command
shows me,
the index created on 'addresses' is not used, and a seq scan is being used.
One would assume that the creation of an index would allow the counting of
the number of entries in a table to be instantanous?

Here are the details:

* Using the latest postgresql 7.3.2 release, built and installed from
sources on a Linux box, under Red Hat 8.0

* I have an 'addresses' table defined as:
Columm | Type
-------------------------------
address | text
city | char var (20)
zip | char var (5)
state | char var (2)
Unique keys: addresses_idx

* I have created a unique index 'addresses_idx' on (address, city, zip,
state):
\d addresses_idx;
Index "addresses_idx"
Columm | Type
-------------------------------
address | text
city | char var (20)
zip | char var (5)
state | char var (2)
unique btree

* I did (re)create the index several times
* I did run the vacuum analyse command several times
* I forced enable_indexscan to true
* I forced enable_seqscan to false

Despite of all of this, each time I try:
===> explain select count(*) from addresses;
I get the following:
===> NOTICE: QUERY PLAN:
===>
===> Aggregate (cost=100012799.89..100012799.89 rows=1 width=0)
===> -> Seq Scan on addresses (cost=100000000.00..100011635.11 rows=465911
width=0)

Quite puzzling, isn't it?
I've searched a bunch of mailing lists and websites, and found many reports
of special cases where it could be argued that the planner may have had a
case for choosing seq scanning over idx scanning, but unless I am missing
some fundamental concept, there's something wrong here.
Any suggestion anyone?
Thanks,

Denis
denis@next2me.com

#2Dennis Gearon
gearond@cvc.net
In reply to: Denis @ Next2Me (#1)
Re: Yet Another (Simple) Case of Index not used

as I remember, mysql keeps the record count in a variable and is instantaneaous
with that kind of query. Recent posts suggest the Postgres does not keep that
variable and has to do the seq scan.

Denis wrote:

Show quoted text

Hi there,
I'm running into a quite puzzling simple example where the index I've
created on a fairly big table (465K entries) is not used, against all common
sense expectations:
The query I am trying to do (fast) is:

select count(*) from addresses;

This takes more than a second to complete, because, as the 'explain' command
shows me,
the index created on 'addresses' is not used, and a seq scan is being used.
One would assume that the creation of an index would allow the counting of
the number of entries in a table to be instantanous?

Here are the details:

* Using the latest postgresql 7.3.2 release, built and installed from
sources on a Linux box, under Red Hat 8.0

* I have an 'addresses' table defined as:
Columm | Type
-------------------------------
address | text
city | char var (20)
zip | char var (5)
state | char var (2)
Unique keys: addresses_idx

* I have created a unique index 'addresses_idx' on (address, city, zip,
state):
\d addresses_idx;
Index "addresses_idx"
Columm | Type
-------------------------------
address | text
city | char var (20)
zip | char var (5)
state | char var (2)
unique btree

* I did (re)create the index several times
* I did run the vacuum analyse command several times
* I forced enable_indexscan to true
* I forced enable_seqscan to false

Despite of all of this, each time I try:
===> explain select count(*) from addresses;
I get the following:
===> NOTICE: QUERY PLAN:
===>
===> Aggregate (cost=100012799.89..100012799.89 rows=1 width=0)
===> -> Seq Scan on addresses (cost=100000000.00..100011635.11 rows=465911
width=0)

Quite puzzling, isn't it?
I've searched a bunch of mailing lists and websites, and found many reports
of special cases where it could be argued that the planner may have had a
case for choosing seq scanning over idx scanning, but unless I am missing
some fundamental concept, there's something wrong here.
Any suggestion anyone?
Thanks,

Denis
denis@next2me.com

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

#3Dann Corbit
DCorbit@connx.com
In reply to: Dennis Gearon (#2)
Re: Yet Another (Simple) Case of Index not used

-----Original Message-----
From: Denis [mailto:denis@next2me.com]
Sent: Tuesday, April 08, 2003 12:57 PM
To: pgsql-performance@postgresql.org;
pgsql-general@postgresql.org; pgsql-sql@postgresql.org
Subject: [GENERAL] Yet Another (Simple) Case of Index not used

Hi there,
I'm running into a quite puzzling simple example where the
index I've created on a fairly big table (465K entries) is
not used, against all common sense expectations: The query I
am trying to do (fast) is:

select count(*) from addresses;

This takes more than a second to complete, because, as the
'explain' command shows me, the index created on 'addresses'
is not used, and a seq scan is being used.

As well it should be.

One would assume
that the creation of an index would allow the counting of the
number of entries in a table to be instantanous?

Traversing the index to perform the count will definitely make the query
many times slower.

A general rule of thumb (not sure if it is true with PostgreSQL) is that
if you have to traverse more than 10% of the data with an index then a
full table scan will be faster. This is especially true when there is
highly redundant data in the index fields. If there were an index on
bit data type, and you have half and half 1 and 0, an index scan of the
table will be disastrous.

To simply scan the table, we will just sequentially read pages until the
data is exhausted. If we follow the index, we will randomly jump from
page to page, defeating the read buffering.
[snip]

#4Dennis Gearon
gearond@cvc.net
In reply to: Dann Corbit (#3)
Re: Yet Another (Simple) Case of Index not used

from mysql manual:
-------------------------------------------------------------
"COUNT(*) is optimized to return very quickly if the SELECT retrieves from one
table, no other columns are retrieved, and there is no WHERE clause. For example:

mysql> select COUNT(*) from student;"
-------------------------------------------------------------

A nice little optimization, maybe not possible in a MVCC system.

Dann Corbit wrote:

Show quoted text

-----Original Message-----
From: Denis [mailto:denis@next2me.com]
Sent: Tuesday, April 08, 2003 12:57 PM
To: pgsql-performance@postgresql.org;
pgsql-general@postgresql.org; pgsql-sql@postgresql.org
Subject: [GENERAL] Yet Another (Simple) Case of Index not used

Hi there,
I'm running into a quite puzzling simple example where the
index I've created on a fairly big table (465K entries) is
not used, against all common sense expectations: The query I
am trying to do (fast) is:

select count(*) from addresses;

This takes more than a second to complete, because, as the
'explain' command shows me, the index created on 'addresses'
is not used, and a seq scan is being used.

As well it should be.

One would assume
that the creation of an index would allow the counting of the
number of entries in a table to be instantanous?

Traversing the index to perform the count will definitely make the query
many times slower.

A general rule of thumb (not sure if it is true with PostgreSQL) is that
if you have to traverse more than 10% of the data with an index then a
full table scan will be faster. This is especially true when there is
highly redundant data in the index fields. If there were an index on
bit data type, and you have half and half 1 and 0, an index scan of the
table will be disastrous.

To simply scan the table, we will just sequentially read pages until the
data is exhausted. If we follow the index, we will randomly jump from
page to page, defeating the read buffering.
[snip]

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

#5Josh Berkus
josh@agliodbs.com
In reply to: Denis @ Next2Me (#1)
Re: [SQL] Yet Another (Simple) Case of Index not used

Dennis,

I'm running into a quite puzzling simple example where the index I've
created on a fairly big table (465K entries) is not used, against all common
sense expectations:
The query I am trying to do (fast) is:

select count(*) from addresses;

PostgreSQL is currently unable to use indexes on aggregate queries. This is
because of two factors:
1) MVCC means that the number of rows must be recalculated for each
connection's current transaction, and cannot be "cached" anywhere by the
database system;
2) Our extensible model of user-defined aggregates means that each aggregate
is a "black box" whose internal operations are invisible to the planner.

This is a known performance issue for Postgres, and I believe that a couple of
people on Hackers are looking at modifying aggregate implementation for 8.0
to use appropriate available indexes, at least for MIN, MAX and COUNT. Until
then, you will need to either put up with the delay, or create a
trigger-driven aggregates caching table.

If you are trying to do a correlated count, like "SELECT type, count(*) from
aggregates GROUP BY type", Tom Lane has already added a hash-aggregates
structure in the 7.4 source that will speed this type of query up
considerably for systems with lots of RAM.

(PS: in the future, please stick to posting questions to one list at a time,
thanks)

--
-Josh Berkus
Aglio Database Solutions
San Francisco

#6Martijn van Oosterhout
kleptog@svana.org
In reply to: Denis @ Next2Me (#1)
Re: Yet Another (Simple) Case of Index not used

On Tue, Apr 08, 2003 at 12:57:16PM -0700, Denis wrote:

The query I am trying to do (fast) is:

select count(*) from addresses;

This takes more than a second to complete, because, as the 'explain' command
shows me,
the index created on 'addresses' is not used, and a seq scan is being used.
One would assume that the creation of an index would allow the counting of
the number of entries in a table to be instantanous?

Incorrect assumption. select count(*) can produce different results in
different backends depending on the current state of the active
transactions.
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

"the West won the world not by the superiority of its ideas or values or
religion but rather by its superiority in applying organized violence.
Westerners often forget this fact, non-Westerners never do."
- Samuel P. Huntington

#7Denis @ Next2Me
denis@next2me.com
In reply to: Martijn van Oosterhout (#6)
Re: [GENERAL] Yet Another (Simple) Case of Index not used

Interesting generic response. In other words, "it all depends".
Well, a de facto observation is: "In my case, it's always much slower with, say, mysql".
Understand me, I don't mean to be starting a performance comparaison mysql vs postgresql,
which is probably an old subject, I am just looking for a solution to solve this type
of performance issues, ie the generic cases:
select count(*) from addresses where address is like 'pattern%';
Which are very fast on mysql, and very slow on postgresql.
Understood, it will always depend on some parameters, but the real question is: how
much control does one have over those parameters, and how does one tweak them to reach
optimal performance?

D.

Show quoted text

-----Original Message-----
From: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-owner@postgresql.org]On Behalf Of Martijn van
Oosterhout
Sent: Tuesday, April 08, 2003 4:46 PM
To: Denis
Cc: pgsql-performance@postgresql.org; pgsql-general@postgresql.org;
pgsql-sql@postgresql.org
Subject: Re: [PERFORM] [GENERAL] Yet Another (Simple) Case of Index not
used

On Tue, Apr 08, 2003 at 12:57:16PM -0700, Denis wrote:

The query I am trying to do (fast) is:

select count(*) from addresses;

This takes more than a second to complete, because, as the 'explain' command
shows me,
the index created on 'addresses' is not used, and a seq scan is being used.
One would assume that the creation of an index would allow the counting of
the number of entries in a table to be instantanous?

Incorrect assumption. select count(*) can produce different results in
different backends depending on the current state of the active
transactions.
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

"the West won the world not by the superiority of its ideas or values or
religion but rather by its superiority in applying organized violence.
Westerners often forget this fact, non-Westerners never do."
- Samuel P. Huntington

#8Denis @ Next2Me
denis@next2me.com
In reply to: Josh Berkus (#5)
Re: [SQL] Yet Another (Simple) Case of Index not used

Josh,

I am no database expert, and even less knowledgeable about the internals
of postgresql, so I'll trust you on the 2 points you make below.

Are you saying the 7.4 'group by' trick would be faster than the simple select count(*)?
That seems hard to believe, being that the request now has to fetch / sort the data.
I must be missing something.

The kind of requests that I am really interested in are:
select count(*) from table where table.column like 'pattern%'
These seems to go much master on mysql (which I guess it not a MVCC database? or wasn't
the Innobase supposed to make it so?), than on postgresql.

So, in the meantime, I've decided to split up my data into two sets,
the static big tables which are handled by mysql, and the rest of it handled
by postgresql....

ps: apologies for the cross-posting.

Show quoted text

-----Original Message-----
From: Josh Berkus [mailto:josh@agliodbs.com]
Sent: Tuesday, April 08, 2003 2:53 PM
To: Denis; pgsql-performance@postgresql.org
Subject: Re: [SQL] Yet Another (Simple) Case of Index not used

Dennis,

I'm running into a quite puzzling simple example where the index I've
created on a fairly big table (465K entries) is not used, against all common
sense expectations:
The query I am trying to do (fast) is:

select count(*) from addresses;

PostgreSQL is currently unable to use indexes on aggregate queries. This is
because of two factors:
1) MVCC means that the number of rows must be recalculated for each
connection's current transaction, and cannot be "cached" anywhere by the
database system;
2) Our extensible model of user-defined aggregates means that each aggregate
is a "black box" whose internal operations are invisible to the planner.

This is a known performance issue for Postgres, and I believe that a couple of
people on Hackers are looking at modifying aggregate implementation for 8.0
to use appropriate available indexes, at least for MIN, MAX and COUNT. Until
then, you will need to either put up with the delay, or create a
trigger-driven aggregates caching table.

If you are trying to do a correlated count, like "SELECT type, count(*) from
aggregates GROUP BY type", Tom Lane has already added a hash-aggregates
structure in the 7.4 source that will speed this type of query up
considerably for systems with lots of RAM.

(PS: in the future, please stick to posting questions to one list at a time,
thanks)

--
-Josh Berkus
Aglio Database Solutions
San Francisco

#9Brent Wood
b.wood@niwa.co.nz
In reply to: Martijn van Oosterhout (#6)
Re: Yet Another (Simple) Case of Index not used

On Wed, 9 Apr 2003, Martijn van Oosterhout wrote:

On Tue, Apr 08, 2003 at 12:57:16PM -0700, Denis wrote:

The query I am trying to do (fast) is:

select count(*) from addresses;

This takes more than a second to complete, because, as the 'explain' command
shows me,
the index created on 'addresses' is not used, and a seq scan is being used.
One would assume that the creation of an index would allow the counting of
the number of entries in a table to be instantanous?

Incorrect assumption. select count(*) can produce different results in
different backends depending on the current state of the active
transactions.

Some thoughts:

Select count(*) is often applied to views, and may take some time
depending on the underlying query.

However, for a single table, I would have thought that if there are no
write locks or open transactions for the table, the index would return a
faster result than a scan? Is there room for some optimisation here?

Does count(<primary_key>) work faster, poss using the unique index on the
key (for non-composite keys)?

Cheers
Brent Wood

#10Stephan Szabo
sszabo@megazone23.bigpanda.com
In reply to: Denis @ Next2Me (#8)
Re: [SQL] Yet Another (Simple) Case of Index not used

On Tue, 8 Apr 2003, Denis @ Next2Me wrote:

The kind of requests that I am really interested in are:
select count(*) from table where table.column like 'pattern%'

If you think an index scan should be faster, you can try
set enable_seqscan=off;
and see if that changes the plan generated by explain and with analyze
you can compare the time used. Without information on the estimated
selectivity it's hard to say what's right.

If it doesn't use the index (ie, it's still using a sequential scan)
after the enable_seqscan=off it's likely that you didn't initdb in "C"
locale in which case like won't use indexes currently (you can see the
archives for long description, but the short one is that some of the
locale rules can cause problems with using the index).

#11Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Denis @ Next2Me (#8)
Re: [SQL] Yet Another (Simple) Case of Index not used

Hi Denis,

The kind of requests that I am really interested in are:
select count(*) from table where table.column like 'pattern%'
These seems to go much master on mysql (which I guess it not a MVCC database? or wasn't
the Innobase supposed to make it so?), than on postgresql.

A few things.

* MVCC in PostgreSQL allows us to be way faster than MySQL when you have heaps of concurrent readers and writers. The tradeoff is that count(*) is slow since PostgreSQL needs to check that each tuple is actually visible to your query (eg. you start a transaction, somone else inserts a row, you do a count(*) - should the result include that new row or not? Answer: no.)

* Just avoid doing count(*) over the entire table with no where clause!!! It's as easy as that

* The LIKE 'pattern%' is indexable in Postgresql. You will need to create a normal btree index over table.column. So long as the index is returning a small portion of the table (eg. say only 5-10% of the fields begin with pattern), then the index will be used and it will be fast.

* If you want really fast full text indexing, check out contrib/tsearch - it's really, really, really fast.

Chris

#12Martijn van Oosterhout
kleptog@svana.org
In reply to: Denis @ Next2Me (#7)
Re: [GENERAL] Yet Another (Simple) Case of Index not used

On Tue, Apr 08, 2003 at 05:10:01PM -0700, Denis @ Next2Me wrote:

Interesting generic response. In other words, "it all depends".
Well, a de facto observation is: "In my case, it's always much slower with, say, mysql".

Curious, is mysql still so fast when you have transactions enabled? How does
it deal with the following:

begin;
delete from bigtable;
select count(*) from bigtable; -- Should return 0
abort;
select count(*) from bigtable; -- Should give original size

Understand me, I don't mean to be starting a performance comparaison mysql
vs postgresql, which is probably an old subject, I am just looking for a
solution to solve this type of performance issues, ie the generic cases:
select count(*) from addresses where address is like 'pattern%';
Which are very fast on mysql, and very slow on postgresql.

Ah, but that may be caused by something else altogether. LIKE is only
indexable in the C locale so if you have en_US as your locale, your LIKE
won't be indexable. See the discussion threads on this mailing list in the past.

Understood, it will always depend on some parameters, but the real
question is: how much control does one have over those parameters, and how
does one tweak them to reach optimal performance?

Hmm, it depends. One person put it that mysql goes for performance first,
then correctness, whereas postgresql goes for correctness first, then
performance.

Maybe fti (full text indexing) would work better?

--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

"the West won the world not by the superiority of its ideas or values or
religion but rather by its superiority in applying organized violence.
Westerners often forget this fact, non-Westerners never do."
- Samuel P. Huntington

#13Denis @ Next2Me
denis@next2me.com
In reply to: Martijn van Oosterhout (#12)
Re: [GENERAL] Yet Another (Simple) Case of Index not used

Stephan, Martijn,
good call, that was it: the C locale.

I had used all the default settings when installing/creating the database,
and apparently it used my default locale (en_US).
I recreated (initdb) the database with --no-locale, and recreated the database,
and sure enough, the query:
select count(*) from table where table.column like 'fol%'
was a zillion (well almost) time faster than it used to be,
and on pair with mysql's performance.
And as expected, the EXPLAIN on that query does show indeed
the use of the index I had created on the table.

Sweet, I can now nuke mysql out of my system.

Folks, thank you all for the help and other suggestions.

Denis Amselem
Next2Me Inc.

Stephan said:

If it doesn't use the index (ie, it's still using a sequential scan)
after the enable_seqscan=off it's likely that you didn't initdb in "C"
locale in which case like won't use indexes currently (you can see the
archives for long description, but the short one is that some of the
locale rules can cause problems with using the index).

Martijn said:

Show quoted text

Ah, but that may be caused by something else altogether. LIKE is only
indexable in the C locale so if you have en_US as your locale, your LIKE
won't be indexable. See the discussion threads on this mailing list in the past.

#14Josh Berkus
josh@agliodbs.com
In reply to: Denis @ Next2Me (#8)
Re: [SQL] Yet Another (Simple) Case of Index not used

Denis,

Are you saying the 7.4 'group by' trick would be faster than the simple
select count(*)? That seems hard to believe, being that the request now has
to fetch / sort the data. I must be missing something.

No, I'm saying that the 7.4 hash-aggregate is faster than the same query was
under 7.2 or 7.3. Much faster. But it does little to speed up a raw
count(*).

The kind of requests that I am really interested in are:
select count(*) from table where table.column like 'pattern%'

Hash-aggregates may, in fact, help with this. Care to try downloading the
the source from CVS?

These seems to go much master on mysql (which I guess it not a MVCC
database? or wasn't the Innobase supposed to make it so?),

They did incorporate a lot of MVCC logic into InnoDB tables, yes. Which means
that if SELECT count(*) on an InnoDB table is just as fast as a MyISAM table,
then it is not accurate. This would be in keeping with MySQL's design
philosophy, which values performance and simplicity over accuracy and
precision -- the opposite of our philosophy.

So, in the meantime, I've decided to split up my data into two sets,
the static big tables which are handled by mysql, and the rest of it
handled by postgresql....

Hey, if it works for you, it's probably easier than dealing with the
PostgreSQL workarounds to this performance issue. I'll ask you to give
PostgreSQL a try for those tables again when 7.4 comes out.

ps: apologies for the cross-posting.

De nada. The Performance list is the right place for this sort of question
in the future.

--
Josh Berkus
Aglio Database Solutions
San Francisco

#15Bruce Momjian
bruce@momjian.us
In reply to: Dennis Gearon (#4)
Re: [PERFORM] Yet Another (Simple) Case of Index not used

Dennis Gearon wrote:

from mysql manual:
-------------------------------------------------------------
"COUNT(*) is optimized to return very quickly if the SELECT retrieves from one
table, no other columns are retrieved, and there is no WHERE clause. For example:

mysql> select COUNT(*) from student;"
-------------------------------------------------------------

A nice little optimization, maybe not possible in a MVCC system.

I think the only thing you can do with MVCC is to cache the value and
tranaction id for "SELECT AGG(*) FROM tab" and make the cached value
visible to transaction id's greater than the one that executed the
query, and invalidate the cache every time the table is modified.

In fact, don't clear the cache, just record the transaction id of the
table modification command so we can use standard visibility routines to
make the cache usable as long as possiible.

The cleanest way would probably be to create an aggregate cache system
table, and to insert into it when someone does an unqualified aggregate,
and to delete from it when someone modifies the table --- the MVCC tuple
visibility rules are handled automatically. Queries can look in there
to see if a visible cached value already exists. Of course, the big
question is whether this would be a big win, and whether the cost of
upkeep would justify it.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#16Richard Huxton
dev@archonet.com
In reply to: Bruce Momjian (#15)
Re: [PERFORM] Yet Another (Simple) Case of Index not used

On Tuesday 15 Apr 2003 3:23 pm, Bruce Momjian wrote:

Dennis Gearon wrote:

from mysql manual:
-------------------------------------------------------------
"COUNT(*) is optimized to return very quickly if the SELECT retrieves
from one table, no other columns are retrieved, and there is no WHERE
clause. For example:

mysql> select COUNT(*) from student;"
-------------------------------------------------------------

The cleanest way would probably be to create an aggregate cache system
table, and to insert into it when someone does an unqualified aggregate,
and to delete from it when someone modifies the table --- the MVCC tuple
visibility rules are handled automatically. Queries can look in there
to see if a visible cached value already exists. Of course, the big
question is whether this would be a big win, and whether the cost of
upkeep would justify it.

If the rule system could handle something like:

CREATE RULE quick_foo_count AS ON SELECT count(*) FROM foo
DO INSTEAD
SELECT quick_count FROM agg_cache WHERE tbl_name='foo';

The whole thing could be handled by user-space triggers/rules and still
invisible to the end-user.

--
Richard Huxton

#17Kevin Brown
kevin@sysexperts.com
In reply to: Josh Berkus (#14)
Re: [SQL] Yet Another (Simple) Case of Index not used

Josh Berkus wrote:

Denis,

Are you saying the 7.4 'group by' trick would be faster than the simple
select count(*)? That seems hard to believe, being that the request now has
to fetch / sort the data. I must be missing something.

No, I'm saying that the 7.4 hash-aggregate is faster than the same query was
under 7.2 or 7.3. Much faster. But it does little to speed up a raw
count(*).

The kind of requests that I am really interested in are:
select count(*) from table where table.column like 'pattern%'

These seems to go much master on mysql (which I guess it not a MVCC
database? or wasn't the Innobase supposed to make it so?),

They did incorporate a lot of MVCC logic into InnoDB tables, yes.
Which means that if SELECT count(*) on an InnoDB table is just as
fast as a MyISAM table, then it is not accurate.

This is not necessarily true. The trigger-based approach to tracking
the current number of rows in a table might well be implemented
internally, and that may actually be much faster than doing it using
triggers (the performance losses you saw may well have been the result
of PG's somewhat poor trigger performance, and not the result of the
approach itself. It would be interesting to know how triggers effect
the performance of other databases).

--
Kevin Brown kevin@sysexperts.com

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kevin Brown (#17)
Re: [SQL] Yet Another (Simple) Case of Index not used

Kevin Brown <kevin@sysexperts.com> writes:

Josh Berkus wrote:

They did incorporate a lot of MVCC logic into InnoDB tables, yes.
Which means that if SELECT count(*) on an InnoDB table is just as
fast as a MyISAM table, then it is not accurate.

This is not necessarily true. The trigger-based approach to tracking
the current number of rows in a table might well be implemented
internally, and that may actually be much faster than doing it using
triggers

You missed the point of Josh's comment: in an MVCC system, the correct
COUNT() varies depending on which transaction is asking. Therefore it
is not possible for a centrally maintained row counter to give accurate
results to everybody, no matter how cheap it is to maintain.

(The cheapness can be disputed as well, since it creates a single point
of contention for all inserts and deletes on the table. But that's a
different topic.)

regards, tom lane

#19Josh Berkus
josh@agliodbs.com
In reply to: Tom Lane (#18)
Re: [SQL] Yet Another (Simple) Case of Index not used

Kevin, Tom:

(The cheapness can be disputed as well, since it creates a single point
of contention for all inserts and deletes on the table. But that's a
different topic.)

Actually, this was the problem with the trigger method of maintaining COUNT
information in PostgreSQL. The statistics table itself becomes a
significant souce of delay, since if a table_A gets 10,000 rows updated than
table_count_A must necessarily be updated 10,000 times ... creating a lot of
dead tuples and severely attenuating the table on disk until the next vacuum
... resulting in Update #10,000 to table_count_A taking 100+ times as long as
Update #1 does, due to the required random seek time on disk.

I can personally think of two ways around this:

In MySQL: store table_count_A as a non-MVCC table or global variable.
Drawback: the count would not be accurate, as you would see changes due to
incomplete transactions and eventually the count would be knocked off
completely by an overload of multi-user activity. However, this does fit
with MySQL's design philosophy of "Speed over accuracy", so I suspect that
that's what they're doing.

In PostgreSQL:
a) Put table_count_A on superfast media like a RAM card so that random seeks
after 10,000 updates do not become a significant delay;
b) create an asynchronious table aggregates collector which would collect
programmed statistics (like count(*) from table A) much in the same way that
the planner statistics collector does. This would have the disadvantage of
on being up to date when the database is idle, but the advantage of not
imposing any significant overhead on Updates.
(Incidentally, I proposed this to one of my clients who complained about
Postgres' slow aggregate performance, but they declined to fund the effort)

--
Josh Berkus
Aglio Database Solutions
San Francisco

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#19)
Re: [SQL] Yet Another (Simple) Case of Index not used

Josh Berkus <josh@agliodbs.com> writes:

In PostgreSQL:
a) Put table_count_A on superfast media like a RAM card so that random seeks
after 10,000 updates do not become a significant delay;

As long as we're talking ugly, here ;-)

You could use a sequence to hold the aggregate counter. A sequence
isn't transactional and so does not accumulate dead tuples. "setval()"
and "select last_value" should have constant-time performance.

regards, tom lane

#21Kevin Brown
kevin@sysexperts.com
In reply to: Tom Lane (#18)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kevin Brown (#21)
#23Kevin Brown
kevin@sysexperts.com
In reply to: Tom Lane (#22)
#24Andreas Pflug
pgadmin@pse-consulting.de
In reply to: Kevin Brown (#23)
#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kevin Brown (#23)
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andreas Pflug (#24)
#27Andreas Pflug
pgadmin@pse-consulting.de
In reply to: Tom Lane (#26)
#28Andrew Sullivan
andrew@libertyrms.info
In reply to: Tom Lane (#25)
#29Kevin Brown
kevin@sysexperts.com
In reply to: Tom Lane (#25)
#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kevin Brown (#29)
#31Josh Berkus
josh@agliodbs.com
In reply to: Kevin Brown (#29)
#32Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Josh Berkus (#19)
#33Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Denis @ Next2Me (#8)
#34Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#22)
#35Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#15)