PostgreSQL 8.0.0 Release Candidate 5

Started by Marc G. Fournierabout 21 years ago114 messages
#1Marc G. Fournier
scrappy@postgresql.org

Due to several small, and one fairly large, bugs that were found in
Release Candidate 4, we have been forced to release our 5th Release (and
hopefully last) Candidate so that we can get some proper testing in on the
changes before release.

A current list of *known* supported platforms can be found at:

http://developer.postgresql.org/supported-platforms.html

We're always looking to improve that list, so we encourage anyone that is
running a platform not listed to please report on any success or failures with
Release Candidate 4.

Baring *any* coding changes (documentation != code) over the next week or so,
we *hope* that this will the final Release Candidate before Full Release, with
that being aimed for the 19th of January.

As always, this release is available on all mirrors, as listed at:

http://wwwmaster.postgresql.org/download/mirrors-ftp

For those using Bittorrent, David Fetter has updated the .torrents,
available at:

http://bt.postgresql.org

Please report any bug reports with this Release Candidate to:

pgsql-bugs@postgresql.org

----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664

#2Simon Riggs
simon@2ndquadrant.com
In reply to: Marc G. Fournier (#1)
SUSE port (was [ANNOUNCE] PostgreSQL 8.0.0 Release Candidate 5)

Not sure what is going on here: why is SUSE not listed on the supported
platforms list? (still)

...is it because Reinhard seems resistant (after private conversation)
to the idea of submitting a formal port report via HACKERS, like
everybody else?

...or is it because his postings to ANNOUNCE that the port to SUSE have
gone unnoticed by those that compile the supported platforms list?

...or both?

There seems no reason for either to occur, but still - no port listed.

Please list the SUSE port, as reported by Reinhard Max.
Please can Reinhard (or another SUSE rep) submit port reports to
pgsql-hackers@postgresql.org.

Best Regards, Simon Riggs

Show quoted text

On Tue, 2005-01-11 at 16:15 -0400, Marc G. Fournier wrote:

Due to several small, and one fairly large, bugs that were found in
Release Candidate 4, we have been forced to release our 5th Release (and
hopefully last) Candidate so that we can get some proper testing in on the
changes before release.

A current list of *known* supported platforms can be found at:

http://developer.postgresql.org/supported-platforms.html

We're always looking to improve that list, so we encourage anyone that is
running a platform not listed to please report on any success or failures with
Release Candidate 4.

Baring *any* coding changes (documentation != code) over the next week or so,
we *hope* that this will the final Release Candidate before Full Release, with
that being aimed for the 19th of January.

As always, this release is available on all mirrors, as listed at:

http://wwwmaster.postgresql.org/download/mirrors-ftp

For those using Bittorrent, David Fetter has updated the .torrents,
available at:

http://bt.postgresql.org

Please report any bug reports with this Release Candidate to:

pgsql-bugs@postgresql.org

----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#2)
Re: SUSE port (was [ANNOUNCE] PostgreSQL 8.0.0 Release Candidate 5)

Simon Riggs <simon@2ndquadrant.com> writes:

Not sure what is going on here: why is SUSE not listed on the supported
platforms list? (still)

I haven't seen any reports of passes on SUSE. I have zero doubt that PG
works on SUSE, since it's pretty much exactly like every other Linux,
but there's been no specific reports on the lists AFAIR.

...is it because Reinhard seems resistant (after private conversation)
to the idea of submitting a formal port report via HACKERS, like
everybody else?

See above.

...or is it because his postings to ANNOUNCE that the port to SUSE have
gone unnoticed by those that compile the supported platforms list?

If he insists on posting such routine stuff to pgsql-announce, he should
not be too surprised that his postings do not get approved. That isn't
the correct forum. We don't peruse the New York Times classified ads
for such reports, either ...

regards, tom lane

#4Reinhard Max
max@suse.de
In reply to: Simon Riggs (#2)
Re: SUSE port (was [ANNOUNCE] PostgreSQL 8.0.0 Release Candidate 5)

Simon,

On Wed, 12 Jan 2005 at 08:23, Simon Riggs wrote:

Not sure what is going on here: why is SUSE not listed on the supported
platforms list? (still)

...is it because Reinhard seems resistant

why do you think so?

(after private conversation) to the idea of submitting a formal port
report via HACKERS, like everybody else?

I andwered you that I will do it, but last week was a short week for
me, and this Monday I had an email from Peter Eisentraut telling me
that he will add the SUSE ports to the list, so I didn't see a need to
send a report in addition.

cu
Reinhard

#5Reinhard Max
max@suse.de
In reply to: Tom Lane (#3)
Re: SUSE port (was [ANNOUNCE] PostgreSQL 8.0.0 Release

Tom,

On Wed, 12 Jan 2005 at 03:53, Tom Lane wrote:

...or is it because his postings to ANNOUNCE that the port to SUSE
have gone unnoticed by those that compile the supported platforms
list?

If he insists on posting such routine stuff to pgsql-announce, he
should not be too surprised that his postings do not get approved.
That isn't the correct forum. We don't peruse the New York Times
classified ads for such reports, either ...

no need to be rude to me for posting one single email to ANNOUNCE
after years of providing the SUSE RPMs silently. There were other
posts about RPM builds on ANNOUNCE, so I thought it would be the right
place to announce mine as well.

BTW, what would be the correct forum to make PostgreSQL users aware of
the appearance of new RPMs, if it is not ANNOUNCE? They are certainly
not expected to be reading the HACKERS list.

cu
Reinhard

#6Noname
simon@2ndquadrant.com
In reply to: Reinhard Max (#4)
Re: Re: SUSE port (was [ANNOUNCE] PostgreSQL 8.0.0 Release Candidate 5)

Reinhard Max <max@suse.de> wrote on 12.01.2005, 11:17:52:

On Wed, 12 Jan 2005 at 08:23, Simon Riggs wrote:

Not sure what is going on here: why is SUSE not listed on the supported
platforms list? (still)

...is it because Reinhard seems resistant

why do you think so?

(after private conversation) to the idea of submitting a formal port
report via HACKERS, like everybody else?

I andwered you that I will do it, but last week was a short week for
me, and this Monday I had an email from Peter Eisentraut telling me
that he will add the SUSE ports to the list, so I didn't see a need to
send a report in addition.

Forgive my admittedly blunt tactics - I want to see SUSE on the list
before we release, and time was short - that's all. If it wasn't for
RC5, no port report would be listed in the docs in the final
release.... It's better to have a SUSE contact support the port than
for me to report it.

I should note here also that SGI have replied "No" to my request for
help (or access) with porting PostgreSQL onto the latest version of
IRIX...

Anyone got access to an HP/UX development system?

Best Regards, Simon Riggs

#7Noname
simon@2ndquadrant.com
In reply to: Noname (#6)
Re: Re: SUSE port (was [ANNOUNCE] PostgreSQL 8.0.0 Release Candidate 5)

Reinhard Max <max@suse.de> wrote on 12.01.2005, 11:38:55:

On Wed, 12 Jan 2005 at 03:53, Tom Lane wrote:

...or is it because his postings to ANNOUNCE that the port to SUSE
have gone unnoticed by those that compile the supported platforms
list?

If he insists on posting such routine stuff to pgsql-announce, he
should not be too surprised that his postings do not get approved.
That isn't the correct forum. We don't peruse the New York Times
classified ads for such reports, either ...

no need to be rude to me for posting one single email to ANNOUNCE
after years of providing the SUSE RPMs silently. There were other
posts about RPM builds on ANNOUNCE, so I thought it would be the right
place to announce mine as well.

Please no more. I started this, so I apologise now to both of you, and
to the list and hope this ends here.

We all look forward to SUSE being a listed port.

Best Regards, Simon Riggs

#8Peter Eisentraut
peter_e@gmx.net
In reply to: Simon Riggs (#2)
Re: SUSE port (was [ANNOUNCE] PostgreSQL 8.0.0 Release Candidate 5)

Simon Riggs wrote:

Not sure what is going on here: why is SUSE not listed on the
supported platforms list? (still)

RC5 contains:

SUSE Linux x86 8.0.0 Peter Eisentraut (<peter_e@gmx.net>), 2005-01-10
9.1

In the meantime I have received confirmation from Reinhard Max that his
test methods match our requirements, so the list will be completed with
the other platforms he reported in due time.

I'm sorry, but I won't just add "it works" notices if it's not clear
what kind of testing was done.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

#9Peter Eisentraut
peter_e@gmx.net
In reply to: Noname (#6)
Re: Re: Re: SUSE port (was [ANNOUNCE] PostgreSQL 8.0.0 Release Candidate 5)

simon@2ndquadrant.com wrote:

I should note here also that SGI have replied "No" to my request for
help (or access) with porting PostgreSQL onto the latest version of
IRIX...

Another year, some result...

Anyone got access to an HP/UX development system?

Check out HP's testdrive program. Many of the port reports for 7.4 were
done there, but I was too busy this time around to do that.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

#10Reinhard Max
max@suse.de
In reply to: Peter Eisentraut (#8)
Re: SUSE port (was [ANNOUNCE] PostgreSQL 8.0.0 Release

On Wed, 12 Jan 2005 at 16:29, Peter Eisentraut wrote:

In the meantime I have received confirmation from Reinhard Max that
his test methods match our requirements, so the list will be
completed with the other platforms he reported in due time.

Today I've updated the RPMs on the FTP server to RC5, which implicitly
means that PostgreSQL passes the regression tests on the provided
platforms.
ftp://ftp.suse.com/pub/projects/postgresql/postgresql-8.0.0rc5

I have also successfully compiled and tested RC5 (but not yet created
RPMs) on the following platforms:

8.1-i386
8.2-i386
sles8-i386
sles8-ia64
sles8-ppc
sles8-ppc64
sles8-s390
sles8-s390x

The only failure I have to report is sles8-x86_64, where I am getting
segfaults from psql during the regression tests. I am still
investigating what's going on there....

cu
Reinhard

#11Reinhard Max
max@suse.de
In reply to: Reinhard Max (#10)
Re: SUSE port (was [ANNOUNCE] PostgreSQL 8.0.0 Release

On Wed, 12 Jan 2005 at 17:29, Reinhard Max wrote:

The only failure I have to report is sles8-x86_64, where I am
getting segfaults from psql during the regression tests.

The segfault in a call to snprintf somewhere in libpq's kerberos5
code. So when I leave out --with-krb5 it compiles and passes the test
suite without problems.

I am still not sure whether the kerberos library, glibc, or PostgreSQL
is to blame, or if it's a combination of bugs in these components that
triggers the segfault.

More details to follow...

cu
Reinhard

#12Jonah H. Harris
jharris@tvi.edu
In reply to: Reinhard Max (#11)
Much Ado About COUNT(*)

Tom, Bruce, and others involved in this recurring TODO discussion�

First, let me start by saying that I understand this has been discussed
many times before; however, I�d like to see what the current state of
affairs is regarding the possibility of using a unique index scan to
speed up the COUNT aggregate.

A few of my customers (some familiar with Oracle) are confused by the
amount of time it takes PostgreSQL to come up with the result and are
hesitating to use it because they think it�s too slow. I�ve tried to
explain to them why it is slow, but in doing so I�ve come to see that
it may be worth working on.

I've reviewed the many messages regarding COUNT(*) and have looked
through some of the source (8.0-RC4) and have arrived at the following
questions:

1. Is there any answer to Bruce�s last statement in the thread, �Re:
[PERFORM] COUNT(*) again (was Re: Index/Function organized�
(http://archives.postgresql.org/pgsql-hackers/2003-10/msg00245.php)

2. What do you think about a separate plan type such as IndexOnlyScan?
Good/stupid/what is he on?

3. Assuming that Bruce�s aforementioned statement is correct, what
hidden performance bottlenecks might there be?

4. What is the consensus of updating a per-relation value containing
the row counts?

Though not exactly like PostgreSQL, Oracle uses MVCC and performs an
index scan on a unique value for all unqualified counts. Admittedly,
counts are faster than they used to be, but this is always a complaint
I hear from open source users and professionals alike.

I�ve been pretty busy, and I still need to get the user/group quota
working with 8.0 and forward the diffs to you all, but I would be
willing to work on speeding up the count(*) if you guys give me
your input.

As always, keep up the good work!

Respectfully,

Jonah H. Harris, Senior Web Administrator
Albuquerque TVI
505.224.4814

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jonah H. Harris (#12)
Re: Much Ado About COUNT(*)

"Jonah H. Harris" <jharris@tvi.edu> writes:

Tom, Bruce, and others involved in this recurring TODO discussion�
First, let me start by saying that I understand this has been discussed
many times before; however, I�d like to see what the current state of
affairs is regarding the possibility of using a unique index scan to
speed up the COUNT aggregate.

It's not happening, because no one has come up with a workable proposal.
In particular, we're not willing to slow down every other operation in
order to make COUNT-*-with-no-WHERE-clause faster.

regards, tom lane

#14Reinhard Max
max@suse.de
In reply to: Reinhard Max (#11)
segfault caused by heimdal (was: SUSE port)

On Wed, 12 Jan 2005 at 18:20, Reinhard Max wrote:

I am still not sure whether the kerberos library, glibc, or
PostgreSQL is to blame, or if it's a combination of bugs in these
components that triggers the segfault.

The problem is, that the heimdal implementation of kerberos5 used on
sles8 needs an extra include statement for com_err.h in
src/interfaces/libpq/fe-auth.c to get the prototype for
error_message(), while on newer SUSE-releases using the MIT Kerberos5
implementation this prototype is provided by krb5.h itself.

So I suspect this bug might hit everyone using heimdal, but it only
gets triggered when one of the calls to kerberos in
src/interfaces/libpq/fe-auth.c returns with an error.

I am still not sure why the crash only happened on x86_64.

cu
Reinhard

#15Merlin Moncure
merlin.moncure@rcsonline.com
In reply to: Tom Lane (#13)
Re: Much Ado About COUNT(*)

Tom, Bruce, and others involved in this recurring TODO discussion...

First, let me start by saying that I understand this has been

discussed

many times before; however, I'd like to see what the current state of
affairs is regarding the possibility of using a unique index scan to
speed up the COUNT aggregate.

To sum up:
1. There are good technical reasons why not to do this. The pg
aggregate system is very elegant...not worth compromising it for a
specific case.
2. postgresql can do many things faster than oracle. If you prefer the
way oracle behaves, use oracle.
3. workaround #1: just run analyze once in a while (you should do that
anyways) and query pg_Class for the #tuples in a relation.
4. workaround #2: rig up a materialized view and query that. This will
be faster than what oracle does, btw, at the price of some coherency.
5. understand that count(*) from t, although frequently used, is of
dubious value in the general sense. Sooner or later someone will
optimize this, but in light of the currently available workarounds it
doesn't seem that important.
6. for large tables, you can get a pretty accurate count by doing:
select count(*) * 10 from t where random() > .9;
on my setup, this shaved about 15% off of the counting time...YMMV.

Merlin

#16Jonah H. Harris
jharris@tvi.edu
In reply to: Tom Lane (#13)
Re: Much Ado About COUNT(*)

Tom,

Thank you for your prompt response and I understand your statement
completely.

My thinking is that we may be able to implement index usage for not only
unqualified counts, but also on any query that can be satisfied by the
index itself. Index usage seems to be a feature that could speed up
PostgreSQL for many people. I'm working on a project right now that
could actually take advantage of it.

Looking at the message boards, there is significant interest in the
COUNT(*) aspect. However, rather than solely address the COUNT(*) TODO
item, why not fix it and add additional functionality found in
commercial databases as well? I believe Oracle has had this feature
since 7.3 and I know people take advantage of it.

I understand that you guys have a lot more important stuff to do than
work on something like this. Unlike other people posting the request and
whining about the speed, I'm offering to take it on and fix it.

Take this mesage as my willingness to propose and implement this
feature. Any details, pitfalls, or suggestions are appreciated.

Thanks again!

-Jonah

Tom Lane wrote:

Show quoted text

"Jonah H. Harris" <jharris@tvi.edu> writes:

Tom, Bruce, and others involved in this recurring TODO discussion�
First, let me start by saying that I understand this has been discussed
many times before; however, I�d like to see what the current state of
affairs is regarding the possibility of using a unique index scan to
speed up the COUNT aggregate.

It's not happening, because no one has come up with a workable proposal.
In particular, we're not willing to slow down every other operation in
order to make COUNT-*-with-no-WHERE-clause faster.

regards, tom lane

#17Reinhard Max
max@suse.de
In reply to: Reinhard Max (#14)
1 attachment(s)
Re: [HACKERS] segfault caused by heimdal (was: SUSE port)

Sorry for following up to myself once more...

On Wed, 12 Jan 2005 at 19:36, Reinhard Max wrote:

The problem is, that the heimdal implementation of kerberos5 used on
sles8 needs an extra include statement for com_err.h in
src/interfaces/libpq/fe-auth.c to get the prototype for
error_message(), while on newer SUSE-releases using the MIT
Kerberos5 implementation this prototype is provided by krb5.h
itself.

after finding and reading the thread on HACKERS about com_err.h from
last December, I think either should configure check if including
krb5.h is sufficient for getting the prototype of error_message(), or
a conditional include for krb5.h should be added to
src/interfaces/libpq/fe-auth.c.

A proposed patch to achieve the latter is attached to this mail.

Either way will lead to a build time error when error_message() isn't
declared or com_err.h can't be found, which is better than the current
situation where only a warning about a missing prototype is issued,
but compilation continues resulting in a broken libpq.

cu
Reinhard

Attachments:

postgresql-krb5.patchtext/plain; charset=US-ASCII; name=postgresql-krb5.patchDownload
--- src/interfaces/libpq/fe-auth.c
+++ src/interfaces/libpq/fe-auth.c
@@ -244,6 +244,11 @@
 
 #include <krb5.h>
 
+#if !defined(__COM_ERR_H) && !defined(__COM_ERR_H__)
+/* if krb5.h didn't include it already */
+#include <com_err.h>
+#endif
+
 /*
  * pg_an_to_ln -- return the local name corresponding to an authentication
  *				  name
#18Greg Stark
gsstark@mit.edu
In reply to: Jonah H. Harris (#16)
Re: Much Ado About COUNT(*)

"Jonah H. Harris" <jharris@tvi.edu> writes:

Looking at the message boards, there is significant interest in the COUNT(*)
aspect. However, rather than solely address the COUNT(*) TODO item, why not fix
it and add additional functionality found in commercial databases as well? I
believe Oracle has had this feature since 7.3 and I know people take advantage
of it.

I think part of the problem is that there's a bunch of features related to
these types of queries and the lines between them blur.

You seem to be talking about putting visibility information inside indexes for
so index-only plans can be performed. But you're also talking about queries
like "select count(*) from foo" with no where clauses. Such a query wouldn't
be helped by index-only scans.

Perhaps you're thinking about caching the total number of records in a global
piece of state like a materialized view? That would be a nice feature but I
think it should done as a general materialized view implementation, not a
special case solution for just this one query.

Perhaps you're thinking of the min/max problem of being able to use indexes to
pick out just the tuples satisfying the min/max constraint. That seems to me
to be one of the more tractable problems in this area but it would still
require lots of work.

I suggest you post a specific query you find is slow. Then discuss how you
think it ought to be executed and why.

--
greg

In reply to: Reinhard Max (#14)
Re: segfault caused by heimdal (was: SUSE port)

On Wed, Jan 12, 2005 at 07:36:52PM +0100, Reinhard Max wrote:

The problem is, that the heimdal implementation of kerberos5 used on
sles8 needs an extra include statement for com_err.h in
src/interfaces/libpq/fe-auth.c to get the prototype for
error_message(), while on newer SUSE-releases using the MIT Kerberos5
implementation this prototype is provided by krb5.h itself.

[...]

I am still not sure why the crash only happened on x86_64.

This is because the proper prototype is:
extern char const *error_message (long);

And C automaticly generates a prototype with in int instead.

On 32 bit platforms this ussualy isn't a problem since both int
and long are ussualy both 32 bit, but on x86_64 a long is 64 bit
while an int is only 32.

Kurt

#20Reinhard Max
max@suse.de
In reply to: Kurt Roeckx (#19)
Re: segfault caused by heimdal (was: SUSE port)

On Wed, 12 Jan 2005 at 20:28, Kurt Roeckx wrote:

This is because the proper prototype is:
extern char const *error_message (long);

And C automaticly generates a prototype with in int instead.

On 32 bit platforms this ussualy isn't a problem since both int and
long are ussualy both 32 bit, but on x86_64 a long is 64 bit while
an int is only 32.

It's actually not the long argument, but the returned pointer that
caused the segfault.

But this only explains why it didn't crash on i386, but not why it
also didn't crash on IA64, ppc64 and s390x which are also LP64
platforms.

cu
Reinhard

#21Jonah H. Harris
jharris@tvi.edu
In reply to: Greg Stark (#18)
Re: Much Ado About COUNT(*)

Greg Stark wrote:

I think part of the problem is that there's a bunch of features related to
these types of queries and the lines between them blur.

You seem to be talking about putting visibility information inside indexes for
so index-only plans can be performed. But you're also talking about queries
like "select count(*) from foo" with no where clauses. Such a query wouldn't
be helped by index-only scans.

Perhaps you're thinking about caching the total number of records in a global
piece of state like a materialized view? That would be a nice feature but I
think it should done as a general materialized view implementation, not a
special case solution for just this one query.

Perhaps you're thinking of the min/max problem of being able to use indexes to
pick out just the tuples satisfying the min/max constraint. That seems to me
to be one of the more tractable problems in this area but it would still
require lots of work.

I suggest you post a specific query you find is slow. Then discuss how you
think it ought to be executed and why.

You are correct, I am proposing to add visibility to the indexes.

As for unqualified counts, I believe that they could take advantage of
an index-only scan as it requires much less I/O to perform an index scan
than a sequential scan on large tables.

Min/Max would also take advantage of index only scans but say, for
example, that someone has the following:

Relation SOME_USERS
user_id BIGINT PK
user_nm varchar(32) UNIQUE INDEX
some_other_attributes...

If an application needs the user names, it would run SELECT user_nm FROM
SOME_USERS... in the current implementation this would require a
sequential scan. On a relation which contains 1M+ tuples, this requires
either a lot of I/O or a lot of cache. An index scan would immensely
speed up this query.

#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jonah H. Harris (#16)
Re: Much Ado About COUNT(*)

"Jonah H. Harris" <jharris@tvi.edu> writes:

My thinking is that we may be able to implement index usage for not only
unqualified counts, but also on any query that can be satisfied by the
index itself.

The fundamental problem is that you can't do it without adding at least
16 bytes, probably 20, to the size of an index tuple header. That would
double the physical size of an index on a simple column (eg an integer
or timestamp). The extra I/O costs and extra maintenance costs are
unattractive to say the least. And it takes away some of the
justification for the whole thing, which is that reading an index is
much cheaper than reading the main table. That's only true if the index
is much smaller than the main table ...

regards, tom lane

#23Merlin Moncure
merlin.moncure@rcsonline.com
In reply to: Tom Lane (#22)
Re: Much Ado About COUNT(*)

Greg Stark wrote:

I think part of the problem is that there's a bunch of features

related

to

these types of queries and the lines between them blur.

You seem to be talking about putting visibility information inside

indexes for

so index-only plans can be performed. But you're also talking about

queries

like "select count(*) from foo" with no where clauses. Such a query

wouldn't

be helped by index-only scans.

Perhaps you're thinking about caching the total number of records in

a

global

piece of state like a materialized view? That would be a nice feature

but

I

think it should done as a general materialized view implementation,

not a

special case solution for just this one query.

Perhaps you're thinking of the min/max problem of being able to use

indexes to

pick out just the tuples satisfying the min/max constraint. That

seems to

me

to be one of the more tractable problems in this area but it would

still

require lots of work.

I suggest you post a specific query you find is slow. Then discuss

how

you

think it ought to be executed and why.

You are correct, I am proposing to add visibility to the indexes.

As for unqualified counts, I believe that they could take advantage of
an index-only scan as it requires much less I/O to perform an index

scan

than a sequential scan on large tables.

I agree with Greg...I think the way to approach this is a general
materialized view solution. This would solve a broad class of tricky
problems including count() and count(*)...you get to choice between the
pay now/pay later tradeoff, etc.

Merlin

#24Jonah H. Harris
jharris@tvi.edu
In reply to: Tom Lane (#22)
Re: Much Ado About COUNT(*)

Tom Lane wrote:

The fundamental problem is that you can't do it without adding at least
16 bytes, probably 20, to the size of an index tuple header. That would
double the physical size of an index on a simple column (eg an integer
or timestamp). The extra I/O costs and extra maintenance costs are
unattractive to say the least. And it takes away some of the
justification for the whole thing, which is that reading an index is
much cheaper than reading the main table. That's only true if the index
is much smaller than the main table ...

regards, tom lane

I recognize the added cost of implementing index only scans. As storage
is relatively cheap these days, everyone I know is more concerned about
faster access to data. Similarly, it would still be faster to scan the
indexes than to perform a sequential scan over the entire relation for
this case. I also acknowledge that it would be a negative impact to
indexes where this type of acces isn't required, as you suggested and
which is more than likely not the case. I just wonder what more people
would be happier with and whether the added 16-20 bytes would be
extremely noticable considering most 1-3 year old hardware.

#25Dann Corbit
DCorbit@connx.com
In reply to: Jonah H. Harris (#24)
Re: Much Ado About COUNT(*)

A notion for indices that are not unique... (won't help much on select
count(*) but might be helpful for other types of query optimization)

Put a count in the index for each distinct type.
In the worst case, the index is actually unique and you have 8 wasted
bytes per index entry and all the entries are in the leaves (perhaps it
could be an OPTION for some tables). I don't know enough about the
structure of PostgreSQL's indexes to know if my suggestion is pure
hogwash, so don't laugh to hard if it is pure stupidity.

The most practical value of SELECT COUNT(*) is for updating statistics
(and looking good in phony-baloney benchmarks). But the statistics only
need to be updated when you vacuum, so it hardly seems a crucial issue
to me.

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tom Lane
Sent: Wednesday, January 12, 2005 11:42 AM
To: Jonah H. Harris
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Much Ado About COUNT(*)

"Jonah H. Harris" <jharris@tvi.edu> writes:

My thinking is that we may be able to implement index usage for not

only

unqualified counts, but also on any query that can be satisfied by the

index itself.

The fundamental problem is that you can't do it without adding at least
16 bytes, probably 20, to the size of an index tuple header. That would
double the physical size of an index on a simple column (eg an integer
or timestamp). The extra I/O costs and extra maintenance costs are
unattractive to say the least. And it takes away some of the
justification for the whole thing, which is that reading an index is
much cheaper than reading the main table. That's only true if the index
is much smaller than the main table ...

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if
your
joining column's datatypes do not match

#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Reinhard Max (#17)
Re: [HACKERS] segfault caused by heimdal (was: SUSE port)

Reinhard Max <max@suse.de> writes:

--- src/interfaces/libpq/fe-auth.c
+++ src/interfaces/libpq/fe-auth.c
@@ -244,6 +244,11 @@

#include <krb5.h>

+#if !defined(__COM_ERR_H) && !defined(__COM_ERR_H__)
+/* if krb5.h didn't include it already */
+#include <com_err.h>
+#endif
+
/*
* pg_an_to_ln -- return the local name corresponding to an authentication
*				  name

That looks like a reasonable fix, but isn't it needed in
backend/libpq/auth.c as well?

regards, tom lane

#27Greg Stark
gsstark@mit.edu
In reply to: Jonah H. Harris (#21)
Re: Much Ado About COUNT(*)

"Jonah H. Harris" <jharris@tvi.edu> writes:

You are correct, I am proposing to add visibility to the indexes.

Then I think the only way you'll get any support is if it's an option. Since
it would incur a performance penalty on updates and deletes.

As for unqualified counts, I believe that they could take advantage of an
index-only scan as it requires much less I/O to perform an index scan than a
sequential scan on large tables.

No, sequential scans require slightly more i/o than index scans. More
importantly they require random access i/o instead of sequential i/o which is
much slower.

Though this depends. If the tuple is very wide then the index might be faster
to scan since it would only contain the data from the fields being indexed.

This brings to mind another approach. It might be handy to split the heap for
a table into multiple heaps. The visibility information would only be in one
of the heaps. This would be a big win if many of the fields were rarely used,
especially if they're rarely used by sequential scans.

Relation SOME_USERS
user_id BIGINT PK
user_nm varchar(32) UNIQUE INDEX
some_other_attributes...

What's with the fetish with unique indexes? None of this is any different for
unique indexes versus non-unique indexes.

--
greg

#28Rod Taylor
pg@rbt.ca
In reply to: Jonah H. Harris (#24)
Re: Much Ado About COUNT(*)

On Wed, 2005-01-12 at 12:52 -0700, Jonah H. Harris wrote:

Tom Lane wrote:

The fundamental problem is that you can't do it without adding at least
16 bytes, probably 20, to the size of an index tuple header. That would
double the physical size of an index on a simple column (eg an integer
or timestamp). The extra I/O costs and extra maintenance costs are
unattractive to say the least. And it takes away some of the
justification for the whole thing, which is that reading an index is
much cheaper than reading the main table. That's only true if the index
is much smaller than the main table ...

I recognize the added cost of implementing index only scans. As storage
is relatively cheap these days, everyone I know is more concerned about
faster access to data. Similarly, it would still be faster to scan the
indexes than to perform a sequential scan over the entire relation for
this case. I also acknowledge that it would be a negative impact to
indexes where this type of acces isn't required, as you suggested and
which is more than likely not the case. I just wonder what more people
would be happier with and whether the added 16-20 bytes would be
extremely noticable considering most 1-3 year old hardware.

I'm very much against this. After some quick math, my database would
grow by about 40GB if this was done. Storage isn't that cheap when you
include the hot-backup master, various slaves, RAM for caching of this
additional index space, backup storage unit on the SAN, tape backups,
additional spindles required to maintain same performance due to
increased IO because I don't very many queries which would receive an
advantage (big one for me -- we started buying spindles for performance
a long time ago), etc.

Make it a new index type if you like, but don't impose any new
performance constraints on folks who have little to no advantage from
the above proposal.

#29Reinhard Max
max@suse.de
In reply to: Tom Lane (#26)
Re: [HACKERS] segfault caused by heimdal (was: SUSE port)

On Wed, 12 Jan 2005 at 14:59, Tom Lane wrote:

That looks like a reasonable fix, but isn't it needed in
backend/libpq/auth.c as well?

Yes, indeed:

auth.c: In function `pg_krb5_init':
auth.c:202: warning: implicit declaration of function `com_err'

cu
Reinhard

#30Andrew Dunstan
andrew@dunslane.net
In reply to: Jonah H. Harris (#24)
Re: Much Ado About COUNT(*)

Jonah H. Harris said:

Tom Lane wrote:

The fundamental problem is that you can't do it without adding at least
16 bytes, probably 20, to the size of an index tuple header. That
would double the physical size of an index on a simple column (eg an
integer or timestamp). The extra I/O costs and extra maintenance costs
are unattractive to say the least. And it takes away some of the
justification for the whole thing, which is that reading an index is
much cheaper than reading the main table. That's only true if the
index is much smaller than the main table ...

I recognize the added cost of implementing index only scans. As
storage is relatively cheap these days, everyone I know is more
concerned about faster access to data. Similarly, it would still be
faster to scan the indexes than to perform a sequential scan over the
entire relation for this case. I also acknowledge that it would be a
negative impact to indexes where this type of acces isn't required, as
you suggested and which is more than likely not the case. I just
wonder what more people would be happier with and whether the added
16-20 bytes would be
extremely noticable considering most 1-3 year old hardware.

Monetary cost is not the issue - cost in time is the issue.

cheers

andrew

#31Jeff Davis
jdavis-pgsql@empires.org
In reply to: Greg Stark (#27)
Re: Much Ado About COUNT(*)

No, sequential scans require slightly more i/o than index scans. More
importantly they require random access i/o instead of sequential i/o which is
much slower.

Just to clear it up, I think what you meant was the index requires
random i/o, not the table. And the word "slightly" depends on the size
of the table I suppose. And of course it also depends on how many tuples
you actually need to retrieve (in this case we're talking about
retrieving all the tuples ragardless).

Though this depends. If the tuple is very wide then the index might be faster
to scan since it would only contain the data from the fields being indexed.

That, and it seems strange on the surface to visit every entry in an
index, since normally indexes are used to find only a small fraction of
the tuples.

This brings to mind another approach. It might be handy to split the heap for
a table into multiple heaps. The visibility information would only be in one
of the heaps. This would be a big win if many of the fields were rarely used,
especially if they're rarely used by sequential scans.

Except then the two heaps would have to be joined somehow for every
operation. It makes sense some times to (if you have a very wide table)
split off the rarely-accessed attributes into a seperate table to be
joined one-to-one when those attributes are needed. To have the system
do that automatically would create problems if the attributes that are
split off are frequently accessed, right?

Perhaps you could optionally create a seperate copy of the same tuple
visibility information linked in a way similar to an index. It still
seems like you gain very little, and only in some very rare situation
that I've never encountered (I've never had the need to do frequent
unqualified count()s at the expense of other operations).

Now, it seems like it might make a little more sense to use an index for
min()/max(), but that's a different story.

Regards,
Jeff Davis

#32Jonah H. Harris
jharris@tvi.edu
In reply to: Andrew Dunstan (#30)
Re: Much Ado About COUNT(*)

Andrew Dunstan wrote:

Monetary cost is not the issue - cost in time is the issue.

cheers

andrew

We seem to be in agreement. I'm looking for faster/smarter access to
data, not the monetary cost of doing so. Isn't it faster/smarter to
satisfy a query with the index rather than sequentially scanning an
entire relation if it is possible?

Replying to the list as a whole:

If this is such a bad idea, why do other database systems use it? As a
businessperson myself, it doesn't seem logical to me that commercial
database companies would spend money on implementing this feature if it
wouldn't be used. Remember guys, I'm just trying to help.

#33Jonah H. Harris
jharris@tvi.edu
In reply to: Rod Taylor (#28)
Re: Much Ado About COUNT(*)

Rod Taylor wrote:

grow by about 40GB if this was done. Storage isn't that cheap when you
include the hot-backup master, various slaves, RAM for caching of this
additional index space, backup storage unit on the SAN, tape backups,
additional spindles required to maintain same performance due to
increased IO because I don't very many queries which would receive an
advantage (big one for me -- we started buying spindles for performance
a long time ago), etc.

Thanks for the calculation and example. This would be a hefty amount of
overhead if none of your queries would benefit from this change.

Make it a new index type if you like, but don't impose any new
performance constraints on folks who have little to no advantage from
the above proposal.

I agree with you that some people may not see any benefit from this and
that it may look worse performance/storage-wise. I've considered this
route, but it seems like more of a workaround than a solution.

#34Jeff Davis
jdavis-pgsql@empires.org
In reply to: Jonah H. Harris (#24)
Re: Much Ado About COUNT(*)

I recognize the added cost of implementing index only scans. As storage
is relatively cheap these days, everyone I know is more concerned about
faster access to data. Similarly, it would still be faster to scan the
indexes than to perform a sequential scan over the entire relation for
this case. I also acknowledge that it would be a negative impact to
indexes where this type of acces isn't required, as you suggested and
which is more than likely not the case. I just wonder what more people
would be happier with and whether the added 16-20 bytes would be
extremely noticable considering most 1-3 year old hardware.

I think perhaps you missed the point: it's not about price. If an index
takes up more space, it will also take more time to read that index.
16-20 bytes per index entry is way too much extra overhead for most
people, no matter what hardware they have. That overhead also tightens
the performace at what is already the bottleneck for almost every DB:
i/o bandwidth.

The cost to count the tuples is the cost to read that visibility
information for each tuple in the table. A seqscan is the most efficient
way to do that since it's sequential i/o, rather than random i/o. The
only reason the word "index" even comes up is because it is inefficient
to retrieve a lot of extra attributes you don't need from a table.

You might be able to pack that visibility information a little bit more
densely in an index than a table, assuming that the table has more than
a couple columns. But if you shoehorn the visibility information into an
index, you destroy much of the value of an index to most people, who
require the index to be compact to be efficient.

An index isn't really the place for something when all you really want
to do is a sequential scan over a smaller amount of data (so that the
visibility information is more dense). Make a narrow table, and seqscan
over that. Then, if you need more attributes in the table, just do a
one-to-one join with a seperate table.

Regards,
Jeff Davis

#35Jon Jensen
jon@endpoint.com
In reply to: Jonah H. Harris (#32)
Re: Much Ado About COUNT(*)

On Wed, 12 Jan 2005, Jonah H. Harris wrote:

Andrew Dunstan wrote:

Monetary cost is not the issue - cost in time is the issue.

We seem to be in agreement. I'm looking for faster/smarter access to data,
not the monetary cost of doing so. Isn't it faster/smarter to satisfy a
query with the index rather than sequentially scanning an entire relation if
it is possible?

Replying to the list as a whole:

If this is such a bad idea, why do other database systems use it? As a
businessperson myself, it doesn't seem logical to me that commercial database
companies would spend money on implementing this feature if it wouldn't be
used. Remember guys, I'm just trying to help.

If you're willing to do the work, and have the motivation, probably the
best thing to do is just do it. Then you can use empirical measurements of
the effect on disk space, speed of various operations, etc. to discuss the
merits/demerits of your particular implementation.

Then others don't need to feel so threatened by the potential change.
Either it'll be (1) an obvious win, or (2) a mixed bag, where allowing the
new way to be specified as an option is a possibility, or (3) you'll have
to go back to the drawing board if it's an obvious loss.

This problem's been talked about a lot, but seeing some code and metrics
from someone with a personal interest in solving it would really be
progress IMHO.

Jon

--
Jon Jensen
End Point Corporation
http://www.endpoint.com/
Software development with Interchange, Perl, PostgreSQL, Apache, Linux, ...

#36Jonah H. Harris
jharris@tvi.edu
In reply to: Jon Jensen (#35)
Re: Much Ado About COUNT(*)

Jon Jensen wrote:

If you're willing to do the work, and have the motivation, probably
the best thing to do is just do it. Then you can use empirical
measurements of the effect on disk space, speed of various operations,
etc. to discuss the merits/demerits of your particular implementation.

Then others don't need to feel so threatened by the potential change.
Either it'll be (1) an obvious win, or (2) a mixed bag, where allowing
the new way to be specified as an option is a possibility, or (3)
you'll have to go back to the drawing board if it's an obvious loss.

This problem's been talked about a lot, but seeing some code and
metrics from someone with a personal interest in solving it would
really be progress IMHO.

Jon,

This is pretty much where I'm coming from. I'm looking at working on
this and I'd rather discuss suggestions for how to implement this than
whether we should have it, etc. If it turns out better, great; if not,
then I just wasted my time. I really do appreciate everyone's comments
and suggestions.

Thanks!

-Jonah

#37Jeff Davis
jdavis-pgsql@empires.org
In reply to: Jonah H. Harris (#32)
Re: Much Ado About COUNT(*)

We seem to be in agreement. I'm looking for faster/smarter access to
data, not the monetary cost of doing so. Isn't it faster/smarter to
satisfy a query with the index rather than sequentially scanning an
entire relation if it is possible?

You have to scan every tuple's visibility information no matter what.
Sequential i/o is fastest (per byte), so the most efficient possible
method is seqscan. Unfortunately for count(*), the tables also store
columns, which are really just clutter as you're moving through the
table looking for visibility information.

Indexes are designed for searches, not exhaustive retrieval of all
records. If you can store that visibility information in a seperate
place so that it's not cluttered by unneeded attributes that could be
helpful, but an index is not the place for that. If you store that in
the index, you are really imposing a new cost on yourself: the cost to
do random i/o as you're jumping around the index trying to access every
entry, plus the index metadata.

You could make a narrow table and join with the other attributes. That
might be a good place that wouldn't clutter up the visibility
information much (it would just need a primary key). A seqscan over that
would be quite efficient.

If this is such a bad idea, why do other database systems use it? As a
businessperson myself, it doesn't seem logical to me that commercial
database companies would spend money on implementing this feature if it
wouldn't be used. Remember guys, I'm just trying to help.

Some databases use an internal counter to count rows as they are
added/deleted. This does not give accurate results in a database that
supports ACID -- more specifically a database that implements the
"isolation" part of ACID. Two concurrent transactions, if the database
supports proper isolation, could have two different results for count(*)
and both would be correct. That makes all the "optimized count(*)"
databases really just give a close number, not the real number. If you
just want a close number, there are other ways of doing that in
PostgreSQL that people have already mentioned.

If you know of a database that supports proper isolation and also has a
faster count(*) I would be interested to know what it is. There may be a
way to do it without sacrificing in other areas, but I don't know what
it is. Does someone know exactly what oracle actually does?

Regards,
Jeff Davis

#38Bruno Wolff III
bruno@wolff.to
In reply to: Jonah H. Harris (#32)
Re: Much Ado About COUNT(*)

On Wed, Jan 12, 2005 at 13:42:58 -0700,
"Jonah H. Harris" <jharris@tvi.edu> wrote:

We seem to be in agreement. I'm looking for faster/smarter access to
data, not the monetary cost of doing so. Isn't it faster/smarter to
satisfy a query with the index rather than sequentially scanning an
entire relation if it is possible?

Not necessarily. Also note that Postgres will use an index scan for
count(*) if there is a relatively selective WHERE clause.

Replying to the list as a whole:

If this is such a bad idea, why do other database systems use it? As a
businessperson myself, it doesn't seem logical to me that commercial
database companies would spend money on implementing this feature if it
wouldn't be used. Remember guys, I'm just trying to help.

Other databases use different ways of handling tuples that are only visible
to some concurrent transactions.

Postgres is also flexible enough that you can make your own materialized
view (using triggers) to handle count(*) if that makes sense for you.

#39Jonah H. Harris
jharris@tvi.edu
In reply to: Jeff Davis (#37)
Re: Much Ado About COUNT(*)

Jeff Davis wrote:

Does someone know exactly what oracle actually does?

some old info resides here, http://www.orsweb.com/techniques/fastfull.html

I'll try and find something more recent.

#40Tom Lane
tgl@sss.pgh.pa.us
In reply to: Reinhard Max (#29)
Re: [HACKERS] segfault caused by heimdal (was: SUSE port)

Reinhard Max <max@suse.de> writes:

On Wed, 12 Jan 2005 at 14:59, Tom Lane wrote:

That looks like a reasonable fix, but isn't it needed in
backend/libpq/auth.c as well?

Yes, indeed:
auth.c: In function `pg_krb5_init':
auth.c:202: warning: implicit declaration of function `com_err'

OK, patch applied in both files.

regards, tom lane

#41Marek Mosiewicz
marekmosiewicz@poczta.onet.pl
In reply to: Jonah H. Harris (#21)
Re: Much Ado About COUNT(*)

I agree with last statement. count(*) is not most important.
Most nice thing with index only scan is when it contains more than one
column.
When there is join among many tables where from each table only one or few
columns are taken
it take boost query incredibly.

For exmaple on when you have customer table and ID, NAME index on it then:

select c.name,i.* from customer c, invoice i where c.id=i.customer_id

then it is HUGE difference there. without index only scan you require to
make index io and
random table access (assuming no full scan). With index only scan you need
only
index scan and can skip expensive random table io.
It is very simple but powerful optmization in many cases to reduce join
expence on many
difficult queries.
You can have get some kind of index organized table (you use only index so
in fact it is
ordered table)

Selecting only few columns is quite often scenario in reporting.

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of Jonah H. Harris
Sent: Wednesday, January 12, 2005 8:36 PM
To: Greg Stark
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Much Ado About COUNT(*)

Greg Stark wrote:

I think part of the problem is that there's a bunch of features related to
these types of queries and the lines between them blur.

You seem to be talking about putting visibility information inside indexes

for

so index-only plans can be performed. But you're also talking about queries
like "select count(*) from foo" with no where clauses. Such a query

wouldn't

be helped by index-only scans.

Perhaps you're thinking about caching the total number of records in a

global

piece of state like a materialized view? That would be a nice feature but I
think it should done as a general materialized view implementation, not a
special case solution for just this one query.

Perhaps you're thinking of the min/max problem of being able to use indexes

to

pick out just the tuples satisfying the min/max constraint. That seems to

me

to be one of the more tractable problems in this area but it would still
require lots of work.

I suggest you post a specific query you find is slow. Then discuss how you
think it ought to be executed and why.

You are correct, I am proposing to add visibility to the indexes.

As for unqualified counts, I believe that they could take advantage of
an index-only scan as it requires much less I/O to perform an index scan
than a sequential scan on large tables.

Min/Max would also take advantage of index only scans but say, for
example, that someone has the following:

Relation SOME_USERS
user_id BIGINT PK
user_nm varchar(32) UNIQUE INDEX
some_other_attributes...

If an application needs the user names, it would run SELECT user_nm FROM
SOME_USERS... in the current implementation this would require a
sequential scan. On a relation which contains 1M+ tuples, this requires
either a lot of I/O or a lot of cache. An index scan would immensely
speed up this query.

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

#42Alvaro Herrera
alvherre@dcc.uchile.cl
In reply to: Jeff Davis (#31)
Re: Much Ado About COUNT(*)

On Wed, Jan 12, 2005 at 12:41:38PM -0800, Jeff Davis wrote:

Except then the two heaps would have to be joined somehow for every
operation. It makes sense some times to (if you have a very wide table)
split off the rarely-accessed attributes into a seperate table to be
joined one-to-one when those attributes are needed. To have the system
do that automatically would create problems if the attributes that are
split off are frequently accessed, right?

That mechanism exists right now, and it's called TOAST, dubbed the best
thing since sliced bread. We even have documentation for it, new as of
our latest RC:

http://developer.postgresql.org/docs/postgres/storage-toast.html

--
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
"El d�a que dejes de cambiar dejar�s de vivir"

#43Simon Riggs
simon@2ndquadrant.com
In reply to: Rod Taylor (#28)
Re: Much Ado About COUNT(*)

On Wed, 2005-01-12 at 15:09 -0500, Rod Taylor wrote:

On Wed, 2005-01-12 at 12:52 -0700, Jonah H. Harris wrote:

Tom Lane wrote:

The fundamental problem is that you can't do it without adding at least
16 bytes, probably 20, to the size of an index tuple header. That would
double the physical size of an index on a simple column (eg an integer
or timestamp). The extra I/O costs and extra maintenance costs are
unattractive to say the least. And it takes away some of the
justification for the whole thing, which is that reading an index is
much cheaper than reading the main table. That's only true if the index
is much smaller than the main table ...

I recognize the added cost of implementing index only scans. As storage
is relatively cheap these days, everyone I know is more concerned about
faster access to data. Similarly, it would still be faster to scan the
indexes than to perform a sequential scan over the entire relation for
this case. I also acknowledge that it would be a negative impact to
indexes where this type of acces isn't required, as you suggested and
which is more than likely not the case. I just wonder what more people
would be happier with and whether the added 16-20 bytes would be
extremely noticable considering most 1-3 year old hardware.

I'm very much against this. After some quick math, my database would
grow by about 40GB if this was done. Storage isn't that cheap when you
include the hot-backup master, various slaves, RAM for caching of this
additional index space, backup storage unit on the SAN, tape backups,
additional spindles required to maintain same performance due to
increased IO because I don't very many queries which would receive an
advantage (big one for me -- we started buying spindles for performance
a long time ago), etc.

Make it a new index type if you like, but don't impose any new
performance constraints on folks who have little to no advantage from
the above proposal.

Jonah,

People's objections are:
- this shouldn't be the system default, so would need to be implemented
as a non-default option on a b-tree index
- its a lot of code and if you want it, you gotta do it

Remember you'll need to
- agree all changes via the list and accept that redesigns may be
required, even at a late stage of coding
- write visibility code into the index
- write an additional node type to handle the new capability
- microarchitecture performance testing so you know whether its really
worthwhile, covering a range of cases
- add code to the optimiser to so it can estimate the cost of using this
and to know when to do this
- add a column to the catalog to record whether an index has the
visibility option
- add code to the parser to invoke the option
- update pg_dump so that it correctly dumps tables with that option
- copy and adapt all of the existing tests for the new mechanism
- document it

If you really want to do all of that, I'm sure you'd get help, but
mostly it will be you that has to drive the change through.

There are some other benefits of that implementation:
You'd be able to vacuum the index (only), allowing index access to
remain reasonably constant, even as the table itself grew from dead
rows.

The index could then make sensible the reasonably common practice of
using a covered index - i.e. putting additional columns into the index
to satisfy the whole query just from the index.

--
Best Regards, Simon Riggs

#44Jonah H. Harris
jharris@tvi.edu
In reply to: Jonah H. Harris (#24)
Re: Much Ado About COUNT(*)

Bruno Wolff III wrote:

On Wed, Jan 12, 2005 at 14:09:07 -0700,
"Jonah H. Harris" <jharris@tvi.edu> wrote:

Please keep stuff posted to the list so that other people can contribute
and learn from the discussion unless there is a particular reason to
limited who is involved in the discussion.

not a problem.

Perhaps you think that the count is somehow saved in the index so that
you don't have to scan through the whole index to get the number of
rows in a table? That isn't the case, but is what creating a materialized
view would effectively do for you.

I understand that the count is not stored in the index. I am saying
that it may be faster to generate the count off the keys in the index.

I shouldn't have titled this message COUNT(*) as that isn't the only
thing I'm trying to accomplish.

#45Bruno Wolff III
bruno@wolff.to
In reply to: Jonah H. Harris (#24)
Re: Much Ado About COUNT(*)

On Wed, Jan 12, 2005 at 14:09:07 -0700,
"Jonah H. Harris" <jharris@tvi.edu> wrote:

Please keep stuff posted to the list so that other people can contribute
and learn from the discussion unless there is a particular reason to
limited who is involved in the discussion.

Bruno,

Thanks for the information. I was told that PostgreSQL couldn't use
index scans for count(*) because of the visibility issue. Has something
changed or was I told incorrectly?

It isn't that it can't, it is that for cases where you are counting more
than a few percent of a table, it will be faster to use a sequential
scan. Part of the reason is that for any hits you get in the index, you
have to check in the table to make sure the current transaction can see
the current tuple. Even if you could just get away with using just an
index scan you are only going to see a constant factor speed up with
probably not too big of a constant.

Perhaps you think that the count is somehow saved in the index so that
you don't have to scan through the whole index to get the number of
rows in a table? That isn't the case, but is what creating a materialized
view would effectively do for you.

#46Jonah H. Harris
jharris@tvi.edu
In reply to: Simon Riggs (#43)
Re: Much Ado About COUNT(*)

Simon Riggs wrote:

Jonah,

People's objections are:
- this shouldn't be the system default, so would need to be implemented
as a non-default option on a b-tree index
- its a lot of code and if you want it, you gotta do it

Remember you'll need to
- agree all changes via the list and accept that redesigns may be
required, even at a late stage of coding
- write visibility code into the index
- write an additional node type to handle the new capability
- microarchitecture performance testing so you know whether its really
worthwhile, covering a range of cases
- add code to the optimiser to so it can estimate the cost of using this
and to know when to do this
- add a column to the catalog to record whether an index has the
visibility option
- add code to the parser to invoke the option
- update pg_dump so that it correctly dumps tables with that option
- copy and adapt all of the existing tests for the new mechanism
- document it

If you really want to do all of that, I'm sure you'd get help, but
mostly it will be you that has to drive the change through.

There are some other benefits of that implementation:
You'd be able to vacuum the index (only), allowing index access to
remain reasonably constant, even as the table itself grew from dead
rows.

The index could then make sensible the reasonably common practice of
using a covered index - i.e. putting additional columns into the index
to satisfy the whole query just from the index.

Simon,

I am willing to take it on and I understand that the workload is mine.
As long as everyone gives me some suggestions, I'm good it being optional.

-Jonah

#47Rod Taylor
pg@rbt.ca
In reply to: Jonah H. Harris (#46)
Re: Much Ado About COUNT(*)

The index could then make sensible the reasonably common practice of
using a covered index - i.e. putting additional columns into the index
to satisfy the whole query just from the index.

I am willing to take it on and I understand that the workload is mine.
As long as everyone gives me some suggestions, I'm good it being optional.

If nobody is working on it, you may find that the below TODO item might
accomplish most of what you're looking for as well as generally
improving performance. The count(*) on a where clause would result in
one index scan and one partial sequential heap scan. Not as fast for the
specific examples you've shown, but far better than today and covers
many other cases as well.

Fetch heap pages matching index entries in sequential order

Rather than randomly accessing heap pages based on index
entries, mark heap pages needing access in a bitmap and do the
lookups in sequential order. Another method would be to sort
heap ctids matching the index before accessing the heap rows.

#48Jeff Davis
jdavis-pgsql@empires.org
In reply to: Alvaro Herrera (#42)
Re: Much Ado About COUNT(*)

That mechanism exists right now, and it's called TOAST, dubbed the best
thing since sliced bread. We even have documentation for it, new as of
our latest RC:

http://developer.postgresql.org/docs/postgres/storage-toast.html

Thanks for the link. It looks like it breaks it up into chunks of about
2KB. I think the conversation was mostly assuming the tables were
somewhat closer to the size of an index. If you have more than 2KB per
tuple, pretty much anything you do with an index would be faster I would
think.

My original concern was if I had a table like (x int) and then postgres
broke the visibility information away form that, that would cause
serious performance problems if postgres had to do a join just to do
"select ... where x = 5". Right?

But of course, we all love toast. Everyone needs to make those wide
tables once in a while, and toast does a great job of taking those
worries away in an efficient way. I am just saying that hopefully we
don't have to seqscan a table with wide tuples very often :)

Regards,
Jeff Davis

#49Greg Stark
gsstark@mit.edu
In reply to: Jeff Davis (#48)
Re: Much Ado About COUNT(*)

Jeff Davis <jdavis-pgsql@empires.org> writes:

But of course, we all love toast. Everyone needs to make those wide
tables once in a while, and toast does a great job of taking those
worries away in an efficient way. I am just saying that hopefully we
don't have to seqscan a table with wide tuples very often :)

I thought toast only handled having individual large columns. So if I have a
2kb text column it'll pull that out of the table for me. But if I have 20
columns each of which have 100 bytes will it still help me? Will it kick in if
I define a single column which stores a record type with 20 columns each of
which have a 100 byte string?

--
greg

#50Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Stark (#49)
Re: Much Ado About COUNT(*)

Greg Stark <gsstark@mit.edu> writes:

I thought toast only handled having individual large columns. So if I have a
2kb text column it'll pull that out of the table for me. But if I have 20
columns each of which have 100 bytes will it still help me? Will it kick in if
I define a single column which stores a record type with 20 columns each of
which have a 100 byte string?

Yes, and yes.

regards, tom lane

#51Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Jonah H. Harris (#12)
Re: Much Ado About COUNT(*)

Jonah H. Harris wrote:

1. Is there any answer to Bruce?s last statement in the thread, ?Re:
[PERFORM] COUNT(*) again (was Re: Index/Function organized?
(http://archives.postgresql.org/pgsql-hackers/2003-10/msg00245.php)

Let me give you my ideas in the above URL and why they are probably
wrong.

My basic idea was to keep a status bit on each index entry telling it if
a previous backend looked at the heap and determined it was valid. Here
are the details:

Each heap tuple stores the creation and expire xid. To determine if a
tuple is visible, you have to check the clog to see if the recorded
transaction ids were committed, in progress, or aborted. When you do
the lookup the first time and the transaction isn't in progress, you can
update a bit to say that the tuple is visible or not. In fact we have
several tuple bits:

#define HEAP_XMIN_COMMITTED 0x0100 /* t_xmin committed */
#define HEAP_XMIN_INVALID 0x0200 /* t_xmin invalid/aborted */
#define HEAP_XMAX_COMMITTED 0x0400 /* t_xmax committed */
#define HEAP_XMAX_INVALID 0x0800 /* t_xmax invalid/aborted */

Once set they allow a later backend access to know the visiblity of the
row without having to re-check clog.

The big overhead in index lookups is having to check the heap row for
visibility. My idea is that once you check the visibility the first
time in the heap, why can't we set some bit in the index so that later
index lookups don't have to look up the heap anymore. Over time most
index entries would have bits set and you wouldn't need to recheck the
heap. (Oh, and you could only update the bit when all active
transactions are newer than the creation transaction so we know they
should all see it as visible.)

Now, this would work for telling us that the transaction that created
the index entry was committed or aborted. Over time most index entries
would have that status.

I think the problem with this idea is expiration. If a row is deleted
we never go and update the index pointing to that heap, so the creation
status isn't enough for us to know that the row is valid.

I can't think of a way to fix this. I think it is expensive to clear
the status bits on a row delete because finding an index row that goes
with a particular heap is quite expensive.

So, those are my ideas. If they could be made to work it would give us
most of the advantages of an index scan with _few_ heap lookups with
very little overhead.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#52Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#51)
Re: Much Ado About COUNT(*)

Bruce Momjian <pgman@candle.pha.pa.us> writes:

My basic idea was to keep a status bit on each index entry telling it if
a previous backend looked at the heap and determined it was valid.

Even if you could track the tuple's committed-good status reliably, that
isn't enough under MVCC. The tuple might be committed good, and seen
that way by some other backend that set the bit, and yet it's not supposed
to be visible to your older transaction. Or the reverse at tuple
deletion.

regards, tom lane

#53Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#52)
Re: Much Ado About COUNT(*)

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

My basic idea was to keep a status bit on each index entry telling it if
a previous backend looked at the heap and determined it was valid.

Even if you could track the tuple's committed-good status reliably,
that isn't enough under MVCC. The tuple might be committed good, and
seen that way by some other backend that set the bit, and yet it's not
supposed to be visible to your older transaction. Or the reverse at
tuple deletion.

I mentioned that:

(Oh, and you could only update the bit when all active transactions
are newer than the creation transaction so we know they should all see
it as visible.)

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#54Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#53)
Re: Much Ado About COUNT(*)

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Tom Lane wrote:

Even if you could track the tuple's committed-good status reliably,
that isn't enough under MVCC.

I mentioned that:

(Oh, and you could only update the bit when all active transactions
are newer than the creation transaction so we know they should all see
it as visible.)

Ah, right, I missed the connection. Hmm ... that's sort of the inverse
of the "killed tuple" optimization we put in a release or two back,
where an index tuple is marked as definitely dead once it's committed
dead and the deletion is older than all active transactions. Maybe that
would work. You'd still have to visit the heap when a tuple is in the
"uncertain" states, but with luck that'd be only a small fraction of the
time.

I'm still concerned about the update costs of maintaining these bits,
but this would at least escape the index-bloat objection. I think we
still have one free bit in index tuple headers...

regards, tom lane

#55Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#54)
Re: Much Ado About COUNT(*)

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Tom Lane wrote:

Even if you could track the tuple's committed-good status reliably,
that isn't enough under MVCC.

I mentioned that:

(Oh, and you could only update the bit when all active transactions
are newer than the creation transaction so we know they should all see
it as visible.)

Ah, right, I missed the connection. Hmm ... that's sort of the inverse
of the "killed tuple" optimization we put in a release or two back,
where an index tuple is marked as definitely dead once it's committed
dead and the deletion is older than all active transactions. Maybe that
would work. You'd still have to visit the heap when a tuple is in the
"uncertain" states, but with luck that'd be only a small fraction of the
time.

Yes, it is sort of the reverse, but how do you get around the delete
case? Even if the bit is set, how do you know it wasn't deleted since
you set the bit? Seems you always have to still check the heap, no?

I'm still concerned about the update costs of maintaining these bits,
but this would at least escape the index-bloat objection. I think we
still have one free bit in index tuple headers...

You mean you are considering clearing the index bit when you delete the
row?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#56Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#55)
Re: Much Ado About COUNT(*)

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Ah, right, I missed the connection. Hmm ... that's sort of the inverse
of the "killed tuple" optimization we put in a release or two back,
where an index tuple is marked as definitely dead once it's committed
dead and the deletion is older than all active transactions.

Yes, it is sort of the reverse, but how do you get around the delete
case?

A would-be deleter of a tuple would have to go and clear the "known
good" bits on all the tuple's index entries before it could commit.
This would bring the tuple back into the "uncertain status" condition
where backends would have to visit the heap to find out what's up.
Eventually the state would become certain again (either dead to
everyone or live to everyone) and one or the other hint bit could be
set again.

The ugly part of this is that clearing the bit is not like setting a
hint bit, ie it's not okay if we lose that change. Therefore, each
bit-clearing would have to be WAL-logged. This is a big part of my
concern about the cost.

regards, tom lane

#57Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Tom Lane (#22)
Re: Much Ado About COUNT(*)

The fundamental problem is that you can't do it without adding at least
16 bytes, probably 20, to the size of an index tuple header. That would
double the physical size of an index on a simple column (eg an integer
or timestamp). The extra I/O costs and extra maintenance costs are
unattractive to say the least. And it takes away some of the
justification for the whole thing, which is that reading an index is
much cheaper than reading the main table. That's only true if the index
is much smaller than the main table ...

Well, the trick would be to have it specified per-index, then it's up to
the user whether it's faster or not...

#58David Garamond
lists@zara.6.isreserved.com
In reply to: Merlin Moncure (#15)
Re: [HACKERS] Much Ado About COUNT(*)

Merlin Moncure wrote:

6. for large tables, you can get a pretty accurate count by doing:
select count(*) * 10 from t where random() > .9;
on my setup, this shaved about 15% off of the counting time...YMMV.

That's an interesting idea, using sampling to get an estimate.

Thanks for the tip.

--
dave

#59Greg Stark
gsstark@mit.edu
In reply to: David Garamond (#58)
Re: [HACKERS] Much Ado About COUNT(*)

David Garamond <lists@zara.6.isreserved.com> writes:

Merlin Moncure wrote:

6. for large tables, you can get a pretty accurate count by doing:
select count(*) * 10 from t where random() > .9;
on my setup, this shaved about 15% off of the counting time...YMMV.

That's an interesting idea, using sampling to get an estimate.

It's an interesting idea but this particular implementation isn't going to
save any time. It still has to read every record only now it has to spend
extra time doing a random() and the arithmetic.

In order for sampling to speed things up you would have to use an index to
actually reduce the number of records read.

The database could be clever and implement the same kind of sampling vacuum
does. That picks a random sampling of pages from the table without using an
index. But there's no way to implement the same kind of behaviour from the
user-visible features.

--
greg

#60Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#56)
Re: Much Ado About COUNT(*)

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Ah, right, I missed the connection. Hmm ... that's sort of the inverse
of the "killed tuple" optimization we put in a release or two back,
where an index tuple is marked as definitely dead once it's committed
dead and the deletion is older than all active transactions.

Yes, it is sort of the reverse, but how do you get around the delete
case?

A would-be deleter of a tuple would have to go and clear the "known
good" bits on all the tuple's index entries before it could commit.
This would bring the tuple back into the "uncertain status" condition
where backends would have to visit the heap to find out what's up.
Eventually the state would become certain again (either dead to
everyone or live to everyone) and one or the other hint bit could be
set again.

Right. Do you think the extra overhead of clearing those bits on
update/delete would be worth it?

The ugly part of this is that clearing the bit is not like setting a
hint bit, ie it's not okay if we lose that change. Therefore, each
bit-clearing would have to be WAL-logged. This is a big part of my
concern about the cost.

Yep, that was my concern too. My feeling is that once you mark the
tuple for expiration (update/delete), you then clear the index bit.
When reading WAL on recovery, you have to clear index bits on rows as
you read expire information from WAL. I don't think it would require
extra WAL information.

The net effect of this idea is that indexes have to look up heap row
status information only for rows that have been recently
created/expired.

If we did have those bits, it would allow us to read data directly from
the index, which is something we don't do now.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#61Csaba Nagy
nagy@ecircle-ag.com
In reply to: Greg Stark (#59)
Re: [HACKERS] Much Ado About COUNT(*)

[snip]

The database could be clever and implement the same kind of sampling vacuum
does. That picks a random sampling of pages from the table without using an
index. But there's no way to implement the same kind of behaviour from the
user-visible features.

... meaning perhaps a new keyword accepted by SELECT, something like
"SAMPLE 1000" ? Which would mean sample records in a 1:1000 ratio ?
Would simplify (and probably speed up a lot) some estimating queries...

Cheers,
Csaba.

#62Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#60)
Re: Much Ado About COUNT(*)

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Tom Lane wrote:

The ugly part of this is that clearing the bit is not like setting a
hint bit, ie it's not okay if we lose that change. Therefore, each
bit-clearing would have to be WAL-logged. This is a big part of my
concern about the cost.

Yep, that was my concern too. My feeling is that once you mark the
tuple for expiration (update/delete), you then clear the index bit.
When reading WAL on recovery, you have to clear index bits on rows as
you read expire information from WAL. I don't think it would require
extra WAL information.

Wrong. The WAL recovery environment is not capable of executing
arbitrary user-defined functions, therefore it cannot compute index
entries on its own. The *only* way we can do this is if the WAL record
stream tells exactly what to do and which physical tuple to do it to.

regards, tom lane

#63Greg Stark
gsstark@mit.edu
In reply to: Csaba Nagy (#61)
Re: [HACKERS] Much Ado About COUNT(*)

Csaba Nagy <nagy@ecircle-ag.com> writes:

[snip]

The database could be clever and implement the same kind of sampling vacuum
does. That picks a random sampling of pages from the table without using an
index. But there's no way to implement the same kind of behaviour from the
user-visible features.

... meaning perhaps a new keyword accepted by SELECT, something like
"SAMPLE 1000" ? Which would mean sample records in a 1:1000 ratio ?
Would simplify (and probably speed up a lot) some estimating queries...

See:

http://www.jlcomp.demon.co.uk/faq/random.html

I think the Oracle syntax looks like

SELECT * FROM foo SAMPLE (0.1)

I don't think I would have picked this syntax but it seems like a better idea
to copy the existing practice rather than invent a new one.

There are some details, like what to do when there's a WHERE clause or joins.
Oracle disallows joins entirely and I'm unclear what the best thing to do
about where clauses would be.

--
greg

#64Csaba Nagy
nagy@ecircle-ag.com
In reply to: Greg Stark (#63)
Re: [HACKERS] Much Ado About COUNT(*)

[snip]

See:

http://www.jlcomp.demon.co.uk/faq/random.html

I think the Oracle syntax looks like

SELECT * FROM foo SAMPLE (0.1)

I don't think I would have picked this syntax but it seems like a better idea
to copy the existing practice rather than invent a new one.

There are some details, like what to do when there's a WHERE clause or joins.
Oracle disallows joins entirely and I'm unclear what the best thing to do
about where clauses would be.

The where clauses could be applied exactly as for a normal select, with
the sampling being just a pre-filtering condition for what rows to
consider.

If there would be a way to specify the table on which to apply the
sampling, then the whole construct could be replaced automatically by
the inline view the oracle link recommends.
I doubt there would be any benefit in sampling more than one table in a
query, it should just work to sample the biggest table, and join the
result with the others. Sampling is only useful for really big tables
anyway.

So the syntax above could be extended to:

SELECT * FROM foo SAMPLE (foo, 0.1)

and:

SELECT foo.*, bar.*
FROM foo, bar
WHERE foo.key = bar.key
SAMPLE (foo, 0.1)

which means sample foo and join the result with bar.

All this makes sense from a user point of view, I wonder how big a PITA
is to implement it...

Cheers,
Csaba.

#65D'Arcy J.M. Cain
darcy@druid.net
In reply to: Tom Lane (#62)
Re: Much Ado About COUNT(*)

On Thu, 13 Jan 2005 10:29:16 -0500
Tom Lane <tgl@sss.pgh.pa.us> wrote:

Wrong. The WAL recovery environment is not capable of executing
arbitrary user-defined functions, therefore it cannot compute index
entries on its own. The *only* way we can do this is if the WAL
record stream tells exactly what to do and which physical tuple to do
it to.

I'm not sure why everyone wants to push this into the database anyway.
If I need to know the count of something, I am probably in a better
position to decide what and how than the database can ever do. For
example, I recently had to track balances for certificates in a database
with 25M certificates with multiple transactions on each. In this case
it is a SUM() instead of a count but the idea is the same. We switched
from the deprecated money type to numeric and the calculations started
taking too long for our purposes. We created a new table to track
balances and created rules to keep it updated. All the complexity and
extra work is limited to changes to that one table and does exactly what
we need it to do. It even deals with transactions that get cancelled
but remain in the table.

If you need the count of entire tables, a simple rule on insert and
delete can manage that for you. A slightly more complicated set of
rules can keep counts based on the value of some field, just like we did
for the certificate ID in the transactions. Getting the database to
magically track this based on arbitrary business rules is guaranteed to
be complex and still not handle everyone's requirements.

-- 
D'Arcy J.M. Cain <darcy@druid.net>         |  Democracy is three wolves
http://www.druid.net/darcy/                |  and a sheep voting on
+1 416 425 1212     (DoD#0082)    (eNTP)   |  what's for dinner.
#66Jonah H. Harris
jharris@tvi.edu
In reply to: D'Arcy J.M. Cain (#65)
Re: Much Ado About COUNT(*)

D'Arcy J.M. Cain wrote:

I'm not sure why everyone wants to push this into the database anyway.
If I need to know the count of something, I am probably in a better
position to decide what and how than the database can ever do. For
example, I recently had to track balances for certificates in a database
with 25M certificates with multiple transactions on each. In this case
it is a SUM() instead of a count but the idea is the same. We switched
from the deprecated money type to numeric and the calculations started
taking too long for our purposes. We created a new table to track
balances and created rules to keep it updated. All the complexity and
extra work is limited to changes to that one table and does exactly what
we need it to do. It even deals with transactions that get cancelled
but remain in the table.

If you need the count of entire tables, a simple rule on insert and
delete can manage that for you. A slightly more complicated set of
rules can keep counts based on the value of some field, just like we did
for the certificate ID in the transactions. Getting the database to
magically track this based on arbitrary business rules is guaranteed to
be complex and still not handle everyone's requirements.

This discussion is not solely related to COUNT, but advanced usage of
the indexes in general.

Did everyone get to read the info on Oracle's fast full index scan? It
performs sequential I/O on the indexes, pulling all of the index blocks
into memory to reduce random I/O to speed up the index scan.

#67Wes
wespvp@syntegra.com
In reply to: Greg Stark (#63)
Re: [HACKERS] Much Ado About COUNT(*)

On 1/13/05 9:50 AM, "Greg Stark" <gsstark@mit.edu> wrote:

I think the Oracle syntax looks like

SELECT * FROM foo SAMPLE (0.1)

I don't think I would have picked this syntax but it seems like a better idea
to copy the existing practice rather than invent a new one.

Of course, in Oracle 'count(*)' is instantaneous. It doesn't have to count
the physical records one by one.

Wes

#68Greg Stark
gsstark@mit.edu
In reply to: Wes (#67)
Re: [HACKERS] Much Ado About COUNT(*)

Wes <wespvp@syntegra.com> writes:

On 1/13/05 9:50 AM, "Greg Stark" <gsstark@mit.edu> wrote:

Of course, in Oracle 'count(*)' is instantaneous. It doesn't have to count
the physical records one by one.

That's simply false. Oracle does indeed have to count the records one by one.

It doesn't have to read and ignore the dead records since they're in a
separate place (but on the other hand it sometimes have to go read that
separate place when it sees records that were committed after your
transaction).

It can also do index-only scans, which often helps, but it's still not
instantaneous.

--
greg

#69Wes
wespvp@syntegra.com
In reply to: Greg Stark (#68)
Re: [HACKERS] Much Ado About COUNT(*)

On 1/13/05 6:44 PM, "Greg Stark" <gsstark@mit.edu> wrote:

That's simply false. Oracle does indeed have to count the records one by one.

It doesn't have to read and ignore the dead records since they're in a
separate place (but on the other hand it sometimes have to go read that
separate place when it sees records that were committed after your
transaction).

It can also do index-only scans, which often helps, but it's still not
instantaneous.

Ok, I stand corrected - I was given some wrong information. However, my
experience has been that count(*) on Oracle is a whole lot faster than
PostgreSQL - what appeared instantaneous on Oracle took some time on
PostgreSQL. That was one of the first things I noticed when moving a
database application to PostgreSQL. I've since disposed of the Oracle
database, so can't go back and retest.

Wes

#70Greg Stark
gsstark@mit.edu
In reply to: Wes (#69)
Re: [HACKERS] Much Ado About COUNT(*)

Wes <wespvp@syntegra.com> writes:

On 1/13/05 6:44 PM, "Greg Stark" <gsstark@mit.edu> wrote:

That's simply false. Oracle does indeed have to count the records one by one.

Ok, I stand corrected - I was given some wrong information. However, my
experience has been that count(*) on Oracle is a whole lot faster than
PostgreSQL - what appeared instantaneous on Oracle took some time on
PostgreSQL. That was one of the first things I noticed when moving a
database application to PostgreSQL. I've since disposed of the Oracle
database, so can't go back and retest.

If it was instantaneous then the data must have all been in cache. A lot of
Oracle kudos really come down to the fact that Oracle is often run on beefier
machines than others.

But if it was merely 2x as fast or so, more if the table was really wide, then
it could have just been because of the fast index-only scan.

If it was more than 2-4x as fast for a narrow table and you don't think the
whole thing was in cache then I would start to wonder about whether your
postgres table suffered from bloat from not having vacuum run frequently
enough or having the fsm settings too low.

--
greg

#71Frank D. Engel, Jr.
fde101@fjrhome.net
In reply to: Wes (#69)
Re: [HACKERS] Much Ado About COUNT(*)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

This is probably stupid for some reason, but why not use a 64-bit
integer to track the number of records in the table? Increment when
adding records, decrement when deleting them... then COUNT(*) could
just return that in cases where a query is known to be looking at all
of the records?

On Jan 14, 2005, at 12:04 PM, Wes wrote:

On 1/13/05 6:44 PM, "Greg Stark" <gsstark@mit.edu> wrote:

That's simply false. Oracle does indeed have to count the records one
by one.

It doesn't have to read and ignore the dead records since they're in a
separate place (but on the other hand it sometimes have to go read
that
separate place when it sees records that were committed after your
transaction).

It can also do index-only scans, which often helps, but it's still not
instantaneous.

Ok, I stand corrected - I was given some wrong information. However,
my
experience has been that count(*) on Oracle is a whole lot faster than
PostgreSQL - what appeared instantaneous on Oracle took some time on
PostgreSQL. That was one of the first things I noticed when moving a
database application to PostgreSQL. I've since disposed of the Oracle
database, so can't go back and retest.

Wes

---------------------------(end of
broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

- -----------------------------------------------------------
Frank D. Engel, Jr. <fde101@fjrhome.net>

$ ln -s /usr/share/kjvbible /usr/manual
$ true | cat /usr/manual | grep "John 3:16"
John 3:16 For God so loved the world, that he gave his only begotten
Son, that whosoever believeth in him should not perish, but have
everlasting life.
$
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (Darwin)

iD8DBQFB6AO47aqtWrR9cZoRAs7UAKCQL3VnSu3770AyFoKW/X7xR7REfQCeK1Ud
M/sIP2jY+6LIfOcrJM5vvUQ=
=Qlbi
-----END PGP SIGNATURE-----

___________________________________________________________
$0 Web Hosting with up to 120MB web space, 1000 MB Transfer
10 Personalized POP and Web E-mail Accounts, and much more.
Signup at www.doteasy.com

#72Richard Huxton
dev@archonet.com
In reply to: Frank D. Engel, Jr. (#71)
Re: [HACKERS] Much Ado About COUNT(*)

Frank D. Engel, Jr. wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

This is probably stupid for some reason, but why not use a 64-bit
integer to track the number of records in the table? Increment when
adding records, decrement when deleting them... then COUNT(*) could just
return that in cases where a query is known to be looking at all of the
records?

Check the list archives for details, but you need to consider multiple
backends inserting/deleting concurrently. What you need is a separate
little table where you can log your transaction-id and number of rows
added/removed then you can figure out how many rows there are from
different viewpoints.

--
Richard Huxton
Archonet Ltd

#73Martijn van Oosterhout
kleptog@svana.org
In reply to: Frank D. Engel, Jr. (#71)
Re: [HACKERS] Much Ado About COUNT(*)

On Fri, Jan 14, 2005 at 12:39:04PM -0500, Frank D. Engel, Jr. wrote:

This is probably stupid for some reason, but why not use a 64-bit
integer to track the number of records in the table? Increment when
adding records, decrement when deleting them... then COUNT(*) could
just return that in cases where a query is known to be looking at all
of the records?

Because there is no single value for count(*), if you're in a
transaction that has added records it will be bigger than in a
transaction that hasn't. How does your integer deal with this?

The usual solutions this involve locking, which is precisely what MVCC
is designed to avoid.

Hope this helps,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.

#74Frank D. Engel, Jr.
fde101@fjrhome.net
In reply to: Martijn van Oosterhout (#73)
Re: [HACKERS] Much Ado About COUNT(*)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Yep, that could cause problems. Okay, now I'm joining the program.

The only thing I can see that would fix this for the integer would be
to keep track of the number of 'committed' records using the integer,
then for each new transaction, make a copy of the integer in order to
remember its "place", using an additional integer to track the offset
(how many rows have been added or removed), so that it can be correctly
applied at commit time. It's probably too messy to be worthwhile this
way, though. More trouble than it would be worth.

On Jan 14, 2005, at 1:38 PM, Martijn van Oosterhout wrote:

On Fri, Jan 14, 2005 at 12:39:04PM -0500, Frank D. Engel, Jr. wrote:

This is probably stupid for some reason, but why not use a 64-bit
integer to track the number of records in the table? Increment when
adding records, decrement when deleting them... then COUNT(*) could
just return that in cases where a query is known to be looking at all
of the records?

Because there is no single value for count(*), if you're in a
transaction that has added records it will be bigger than in a
transaction that hasn't. How does your integer deal with this?

The usual solutions this involve locking, which is precisely what MVCC
is designed to avoid.

Hope this helps,
--
Martijn van Oosterhout <kleptog@svana.org>
http://svana.org/kleptog/

Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is
a
tool for doing 5% of the work and then sitting around waiting for
someone
else to do the other 95% so you can sue them.

- -----------------------------------------------------------
Frank D. Engel, Jr. <fde101@fjrhome.net>

$ ln -s /usr/share/kjvbible /usr/manual
$ true | cat /usr/manual | grep "John 3:16"
John 3:16 For God so loved the world, that he gave his only begotten
Son, that whosoever believeth in him should not perish, but have
everlasting life.
$
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (Darwin)

iD8DBQFB6BPb7aqtWrR9cZoRAjHHAJ9gOp6EOuZtvZATLX+3AUbvhQQmOwCdFF6J
+6JlJKNjrTlYW/8kqu+Z9Xs=
=OV2y
-----END PGP SIGNATURE-----

___________________________________________________________
$0 Web Hosting with up to 120MB web space, 1000 MB Transfer
10 Personalized POP and Web E-mail Accounts, and much more.
Signup at www.doteasy.com

#75PFC
lists@boutiquenumerique.com
In reply to: Frank D. Engel, Jr. (#74)
MOVE

Hello,

Here I'm implementing a session management, which has a connections table
partitioned between
active and archived connections. A connection represents a connection
between a user and a chatroom.

I use partitioning for performance reasons.

The active table contains all the data for the active session : user_id,
chatroom_id, session start
time, and other information.
The archive table contains just the user_id, chatroom_id, session start
and end time, for logging
purposes, and for displaying on the site, which user was logged to which
chatroom and from when to when.

Thus, when a user disconnects from a chatroom, I must move one row from
the active to the archive
table. This poses no problem as there is a UNIQUE index
(iser_id,chatroom_id) so I select the row FOR
UPDATE, insert it in the archive table, then delete it.

Now, when a user logs out from the site, or when his session is purged by
the auto-expiration cron
job, I must also expire ALL his open chatroom connections.
INSERT INTO archive (...) SELECT ... FROM active WHERE user_id = ...;
DELETE FROM active WHERE user_id = ...;

Now, if the user inserts a connection between the two queries above, the
thing will fail (the
connection will just be deleted). I know that there are many ways to do it
right :
- LOCK the table in exclusive mode
- use an additional primary key on the active table which is not related
to the user_id and the
chatroom_id, select the id's of the sessions to expire in a temporary
table, and use that
- use an extra field in the table to mark that the rows are being processed
- use transaction isolation level SERIALIZABLE

However, all these methods somehow don't feel right, and as this is an
often encountered problem,
I'd really like to have a sql command, say MOVE, or SELECT AND DELETE,
whatever, which acts like a SELECT,
returning the rows, but deleting them as well. Then I'd just do INSERT
INTO archive (...) SELECT ... AND
DELETE FROM active WHERE user_id = ...;

which would have the following advantages :
- No worries about locks :
- less chance of bugs
- higher performance because locks have to be waited on, by definition
- No need to do the request twice (so, it is twice as fast !)
- Simplicity and elegance

There would be an hidden bonus, that if you acquire locks, you better
COMMIT the transaction as
soon as possible to release them, whereas here, you can happily continue
in the transaction.

I think this command would make a nice cousin to the also very popular
INSERT... OR UPDATE which
tries to insert a row, and if it exists, UPDATES it instead of inserting
it !

What do you think ?

#76Martijn van Oosterhout
kleptog@svana.org
In reply to: PFC (#75)
Re: MOVE

On Fri, Jan 14, 2005 at 08:49:24PM +0100, PFC wrote:

the auto-expiration cron
job, I must also expire ALL his open chatroom connections.
INSERT INTO archive (...) SELECT ... FROM active WHERE user_id = ...;
DELETE FROM active WHERE user_id = ...;

Now, if the user inserts a connection between the two queries above, the
thing will fail (the
connection will just be deleted). I know that there are many ways to do it
right :

Why not just do it in a single transaction? I don't think you need to
use SERIALIZABLE at all, I think normal read-committed mode will do
what you want, no?

BEGIN;
INSERT INTO archive (...) SELECT ... FROM active WHERE user_id = ...;
DELETE FROM active WHERE user_id = ...;
COMMIT;

The DELETE can only delete the rows returned by the select, that's the
whole point of transactions...

Hope this helps,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.

#77Tom Lane
tgl@sss.pgh.pa.us
In reply to: Martijn van Oosterhout (#76)
Re: MOVE

Martijn van Oosterhout <kleptog@svana.org> writes:

Why not just do it in a single transaction? I don't think you need to
use SERIALIZABLE at all, I think normal read-committed mode will do
what you want, no?

BEGIN;
INSERT INTO archive (...) SELECT ... FROM active WHERE user_id = ...;
DELETE FROM active WHERE user_id = ...;
COMMIT;

No, that's exactly wrong: in read-committed mode the DELETE could delete
rows that were not seen by the SELECT. It would work in serializable
mode though.

regards, tom lane

#78Wes
wespvp@syntegra.com
In reply to: Frank D. Engel, Jr. (#74)
Re: [HACKERS] Much Ado About COUNT(*)

On 1/14/05 12:47 PM, "Frank D. Engel, Jr." <fde101@fjrhome.net> wrote:

It's probably too messy to be worthwhile this
way, though. More trouble than it would be worth.

It would be rather useful if there was a way to get a reasonably accurate
count (better than analyze provides) in a very short period. When you've
got a relatively wide table that has hundreds of millions to over a billion
rows, and you need to report on how many rows in the table, that can take a
long time.

Wes

#79PFC
lists@boutiquenumerique.com
In reply to: Martijn van Oosterhout (#76)
Re: MOVE

BEGIN;
INSERT INTO archive (...) SELECT ... FROM active WHERE user_id = ...;
DELETE FROM active WHERE user_id = ...;
COMMIT;

The DELETE can only delete the rows returned by the select, that's the
whole point of transactions...

Well, in the case of having a unique index on user_id, and if no-one
updates the row between the insert and the delete, it will work ;)
And if someone updates it in between, well, the update is not archived,
so you have to LOCK your row FOR UPDATE.
And if this procedure is called twice at the same time, the rows will be
historized twice...
etc...

Which is precisely why I don't like this approach !

As a side note, I've installed 8.0 rc, and wow. The slow queries feel a
lot faster on the command prompt, my small queries have became faster
too... very good work !

In the end, I've implemented it with an AFTER DELETE trigger on the
'live' table, which, after the row has been deleted, inserts it in the
history table, using the magic variable OLD. This will work because the
row is already deleted, thus can't be concurrently updated by another
transaction (because a transaction trying to update a row will wait on the
lock acquired by the DELETE, and vice versa).

So, for ONE row at a time, in a trigger, it works beautifully, thank you
postgres ! I find the solution very elegant : when a session expires, it
is deleted from the live session table, then the trigger catches it just
in time to shove it in the sessions history table, then some other tables
like user-to-chatroom connections, which happen to have a user_id
referencing into the live sessions table, get the ON DELETE CASCADE and
are also purged and historized automatically. I am very happy with this
solution BUT it's done one-row-at-a-time, so it's slower than I'd like !

The key is to insert the row deleted from the live table into the history
table AFTER it has been deleted, to avoid all funky locks problems. Now
consider the following : you must DELETE several items at the same time
and historize them.

If you INSERT then DELETE:
- records can be inserted, updated or deleted between the two. The
inserted ones will be historized but not deleted (duplicates !), the
deleted ones will be lost forever, unhistorized, the updated ones won't
have their updates historized. Not very well for concurrecy !

You can LOCK FOR UPDATE before, this solves the UPDATE and DELETE
problem, but not the INSERT problem.
You can, of course, lock the entire table, but well, it reminds me too
much of the MySQL way.
You can also use SERIALIZABLE mode which solves all the problems, but if
something goes wrong, everything fails and you have to retry the whole
trasaction, whereas a proper lock would be waited on...
If there is a primary key in the 'live' table you can SELECT FOR UPDATE
into a tamporary table, then delete using the pkeys in the temp table,
then insert from the temp table... ugly !

That's why I bother you to have the possibility of DELETE returning the
DELETE'd rows ;)

It's not very useful if you process one row, but when you process several
at a time, it would be really great, because instead of 2*N queries
(DELETE+INSERT hidden in a trigger) you'd just do one (INSERT ... DELETE
AND SELECT ... FROM ...). Today, if you don't want to do it in a trigger,
you have to have a unique index, SELECT FOR UPDATE, INSERT, DELETE, that's
three queries per row.
In a perfect world, this would be then used in an ON DELETE RULE which
would replace the DELETES by deletes inserting the rows into the history
table

Also I've thought about some other interesting applications, if DELETE
returns rows, why not UPDATE or even INSERT ?
Many applications use INSERT... then SELECT currval(sequence). I also
like to set defaults in the database, like for instance some rows which
have timestamp fields defaulting to now() or things like that. I have a
tree table with a ltree field which is generated by a trigger from the
parent's path and the current row's id. Some other fields are also
inherited from the parent. Why not do INSERT INTO ... AND SELECT ... which
would return the sequence field, and any other fields which have been
initialized by ON INSERT triggers... this would be neat... instead of
INSERT, SELECT currval, SELECT .. FROM table WHERE id=...
Same thing for on update triggers.
You could replace some plpgsql procedures with one query, and what's more
important, not worry about locking headaches.

Anyway, my problem is solved now with triggers, but I like the idea very
much (and Oracle has it) (and Tom once said a DELETE was just more or less
like a SELECT)... so ...

Regards

#80Greg Stark
gsstark@mit.edu
In reply to: Frank D. Engel, Jr. (#74)
Re: [HACKERS] Much Ado About COUNT(*)

"Frank D. Engel, Jr." <fde101@fjrhome.net> writes:

Yep, that could cause problems. Okay, now I'm joining the program.

The only thing I can see that would fix this
...

There are well understood mechanisms to fix this. It's a "SMOP" or "simple
matter of programming". What you would do is insert into a summary table a
record that indicates how many records you've inserted into the master table.
Periodically you have some daemon collect up those records and replace them
with a single record.

But this can be done already by hand and it's not clear having the database do
it automatically is necessarily a good idea. It would impose a cost on every
insert when most of the time it wouldn't be useful.

Moreover this is just a special case of a general problem called "materialized
views". If it were added to the database it would probably be more worthwhile
implementing a more general feature that could handle other aggregate
functions besides count(*) as well as other types of queries besides simple
unqualified aggregates.

--
greg

#81Manfred Koizar
mkoi-pg@aon.at
In reply to: Tom Lane (#56)
Re: Much Ado About COUNT(*)

On Thu, 13 Jan 2005 00:39:56 -0500, Tom Lane <tgl@sss.pgh.pa.us>
wrote:

A would-be deleter of a tuple would have to go and clear the "known
good" bits on all the tuple's index entries before it could commit.
This would bring the tuple back into the "uncertain status" condition
where backends would have to visit the heap to find out what's up.
Eventually the state would become certain again (either dead to
everyone or live to everyone) and one or the other hint bit could be
set again.

Last time we discussed this, didn't we come to the conclusion, that
resetting status bits is not a good idea because of possible race
conditions?

In a previous post you wrote:
| I think we still have one free bit in index tuple headers...

AFAICS we'd need two new bits: "visible to all" and "maybe dead".

Writing the three status bits in the order "visible to all", "maybe
dead", "known dead", a normal index tuple lifecycle would be

000 -> 100 -> 110 -> 111

In states 000 and 110 the heap tuple has to be read to determine
visibility.

The transitions 000 -> 100 and 110 -> 111 happen as side effects of
index scans. 100 -> 110 has to be done by the deleting transaction.
This is the operation where the additional run time cost lies.

One weakness of this approach is that once the index tuple state is
110 but the deleting transaction is aborted there is no easy way to
reset the "maybe deleted" bit. So we'd have to consult the heap for
the rest of the tuple's lifetime.

Servus
Manfred

#82Tom Lane
tgl@sss.pgh.pa.us
In reply to: Manfred Koizar (#81)
Re: Much Ado About COUNT(*)

Manfred Koizar <mkoi-pg@aon.at> writes:

On Thu, 13 Jan 2005 00:39:56 -0500, Tom Lane <tgl@sss.pgh.pa.us>
wrote:

A would-be deleter of a tuple would have to go and clear the "known
good" bits on all the tuple's index entries before it could commit.
This would bring the tuple back into the "uncertain status" condition
where backends would have to visit the heap to find out what's up.
Eventually the state would become certain again (either dead to
everyone or live to everyone) and one or the other hint bit could be
set again.

Last time we discussed this, didn't we come to the conclusion, that
resetting status bits is not a good idea because of possible race
conditions?

There's no race condition, since resetting the hint only forces other
transactions to go back to the original data (the tuple header) to
decide what to do. AFAICS the above is safe; I'm just pretty dubious
about the cost.

AFAICS we'd need two new bits: "visible to all" and "maybe dead".

No, you've got this wrong. The three possible states are "known visible
to all", "known dead to all", and "uncertain". If you see "uncertain"
this means you have to go to the heap and compare the XIDs in the tuple
header to your snapshot to decide if you can see the row or not. The
index states are not the same as the "known committed good" or
"known committed dead" hint bits in the tuple header --- those can be
set as soon as the inserting/deleting transaction's outcome is known,
but we can't move the index entry into the "visible to all" or "dead to
all" states until that outcome is beyond the GlobalXmin event horizon.

regards, tom lane

#83Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#82)
Re: Much Ado About COUNT(*)

Tom Lane <tgl@sss.pgh.pa.us> writes:

Manfred Koizar <mkoi-pg@aon.at> writes:

Last time we discussed this, didn't we come to the conclusion, that
resetting status bits is not a good idea because of possible race
conditions?

There's no race condition,

Actually, wait a minute --- you have a point. Consider a tuple whose
inserting transaction (A) has just dropped below GlobalXmin.
Transaction B is doing an index scan, so it's going to do something like

* Visit index entry, observe that it is in "uncertain" state.
* Visit heap tuple, observe that A has committed and is < GlobalXmin,
and there is no deleter.
* Return to index entry and mark it "visible to all".

Now suppose transaction C decides to delete the tuple. It will

* Insert itself as the XMAX of the heap tuple.
* Visit index entry, set state to "uncertain" if not already that way.

C could do this between steps 2 and 3 of B, in which case the index
entry ends up improperly marked "visible to all" while in fact a
deletion is pending. Ugh. We'd need some kind of interlock to prevent
this from happening, and it's not clear what. Might be tricky to create
such an interlock without introducing either deadlock or a big
performance penalty.

regards, tom lane

#84Jochem van Dieten
jochemd@gmail.com
In reply to: Manfred Koizar (#81)
Re: Much Ado About COUNT(*)

On Sun, 16 Jan 2005 20:49:45 +0100, Manfred Koizar wrote:

On Thu, 13 Jan 2005 00:39:56 -0500, Tom Lane wrote:

A would-be deleter of a tuple would have to go and clear the "known
good" bits on all the tuple's index entries before it could commit.
This would bring the tuple back into the "uncertain status" condition
where backends would have to visit the heap to find out what's up.
Eventually the state would become certain again (either dead to
everyone or live to everyone) and one or the other hint bit could be
set again.

Last time we discussed this, didn't we come to the conclusion, that
resetting status bits is not a good idea because of possible race
conditions?

http://archives.postgresql.org/pgsql-performance/2004-05/msg00004.php

In a previous post you wrote:
| I think we still have one free bit in index tuple headers...

AFAICS we'd need two new bits: "visible to all" and "maybe dead".

Writing the three status bits in the order "visible to all", "maybe
dead", "known dead", a normal index tuple lifecycle would be

000 -> 100 -> 110 -> 111

In states 000 and 110 the heap tuple has to be read to determine
visibility.

The transitions 000 -> 100 and 110 -> 111 happen as side effects of
index scans. 100 -> 110 has to be done by the deleting transaction.
This is the operation where the additional run time cost lies.

One weakness of this approach is that once the index tuple state is
110 but the deleting transaction is aborted there is no easy way to
reset the "maybe deleted" bit. So we'd have to consult the heap for
the rest of the tuple's lifetime.

How bad is that really with a typical workload?

Jochem

#85Jim C. Nasby
decibel@decibel.org
In reply to: Tom Lane (#83)
Re: Much Ado About COUNT(*)

On Sun, Jan 16, 2005 at 03:22:11PM -0500, Tom Lane wrote:

Tom Lane <tgl@sss.pgh.pa.us> writes:

Manfred Koizar <mkoi-pg@aon.at> writes:

Last time we discussed this, didn't we come to the conclusion, that
resetting status bits is not a good idea because of possible race
conditions?

There's no race condition,

Actually, wait a minute --- you have a point. Consider a tuple whose
inserting transaction (A) has just dropped below GlobalXmin.
Transaction B is doing an index scan, so it's going to do something like

* Visit index entry, observe that it is in "uncertain" state.
* Visit heap tuple, observe that A has committed and is < GlobalXmin,
and there is no deleter.
* Return to index entry and mark it "visible to all".

Now suppose transaction C decides to delete the tuple. It will

* Insert itself as the XMAX of the heap tuple.
* Visit index entry, set state to "uncertain" if not already that way.

C could do this between steps 2 and 3 of B, in which case the index
entry ends up improperly marked "visible to all" while in fact a
deletion is pending. Ugh. We'd need some kind of interlock to prevent
this from happening, and it's not clear what. Might be tricky to create
such an interlock without introducing either deadlock or a big
performance penalty.

Wouldn't the original proposal that had a state machine handle this?
IIRC the original idea was:

new tuple -> known good -> possibly dead -> known dead

In this case, you would have to visit the heap page when an entry is in
the 'new tuple' or 'possibly dead' states. When it comes to transitions,
you would enforce the transitions as shown, which would eliminate the
race condition you thought of.

Err, I guess maybe you have to allow going from possibly dead back to
known good? But I think that would be the only 'backwards' transition.
In the event of a rollback on an insert I think you'd want to go
directly from new tuple to known dead, as well.
--
Jim C. Nasby, Database Consultant decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

#86Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim C. Nasby (#85)
Re: Much Ado About COUNT(*)

"Jim C. Nasby" <decibel@decibel.org> writes:

Wouldn't the original proposal that had a state machine handle this?
IIRC the original idea was:

new tuple -> known good -> possibly dead -> known dead

Only if you disallow the transition from possibly dead back to known
good, which strikes me as a rather large disadvantage. Failed UPDATEs
aren't so uncommon that it's okay to have one permanently disable the
optimization.

regards, tom lane

#87Jochem van Dieten
jochemd@gmail.com
In reply to: Tom Lane (#86)
Re: Much Ado About COUNT(*)

On Sun, 16 Jan 2005 20:01:36 -0500, Tom Lane wrote:

"Jim C. Nasby" writes:

Wouldn't the original proposal that had a state machine handle this?
IIRC the original idea was:

new tuple -> known good -> possibly dead -> known dead

Only if you disallow the transition from possibly dead back to known
good, which strikes me as a rather large disadvantage. Failed UPDATEs
aren't so uncommon that it's okay to have one permanently disable the
optimization.

But how about allowing the transition from "possibly dead" to "new
tuple"? What if a failed update restores the tuple to the "new tuple"
state, and only after that it can be promoted to "known good" state?

Jochem

#88Jim C. Nasby
decibel@decibel.org
In reply to: Tom Lane (#86)
Re: Much Ado About COUNT(*)

On Sun, Jan 16, 2005 at 08:01:36PM -0500, Tom Lane wrote:

"Jim C. Nasby" <decibel@decibel.org> writes:

Wouldn't the original proposal that had a state machine handle this?
IIRC the original idea was:

new tuple -> known good -> possibly dead -> known dead

Only if you disallow the transition from possibly dead back to known
good, which strikes me as a rather large disadvantage. Failed UPDATEs
aren't so uncommon that it's okay to have one permanently disable the
optimization.

Actually, I guess I wasn't understanding the problem to begin with.
You'd never go from new tuple to known good while the transaction that
created the tuple was in-flight, right? If that's the case, I'm not sure
where there's a race condition. You can't delete a tuple that hasn't
been committed, right?
--
Jim C. Nasby, Database Consultant decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

#89Jim C. Nasby
decibel@decibel.org
In reply to: Jim C. Nasby (#88)
Re: Much Ado About COUNT(*)

On Sun, Jan 16, 2005 at 07:28:07PM -0600, Jim C. Nasby wrote:

On Sun, Jan 16, 2005 at 08:01:36PM -0500, Tom Lane wrote:

"Jim C. Nasby" <decibel@decibel.org> writes:

Wouldn't the original proposal that had a state machine handle this?
IIRC the original idea was:

new tuple -> known good -> possibly dead -> known dead

Only if you disallow the transition from possibly dead back to known
good, which strikes me as a rather large disadvantage. Failed UPDATEs
aren't so uncommon that it's okay to have one permanently disable the
optimization.

Actually, I guess I wasn't understanding the problem to begin with.
You'd never go from new tuple to known good while the transaction that
created the tuple was in-flight, right? If that's the case, I'm not sure
where there's a race condition. You can't delete a tuple that hasn't
been committed, right?

Er, nevermind, I thought about it and realized the issue.

What changes when a delete is done on a tuple? It seems that's the
key... if a tuple has been marked as possibly being expired/deleted,
don't allow it to go into known_good unless you can verify that the
transaction that marked it as potentially deleted was rolled back.
--
Jim C. Nasby, Database Consultant decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

#90Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim C. Nasby (#88)
Re: Much Ado About COUNT(*)

"Jim C. Nasby" <decibel@decibel.org> writes:

Actually, I guess I wasn't understanding the problem to begin with.
You'd never go from new tuple to known good while the transaction that
created the tuple was in-flight, right?

By definition, not.

If that's the case, I'm not sure
where there's a race condition. You can't delete a tuple that hasn't
been committed, right?

The originating transaction could itself delete the tuple, but no one
else could see it yet to do that. This means that you'd have to allow
a transition directly from new tuple to possibly dead. (In the absence
of subtransactions this could be optimized into a transition directly
to known dead, but now that we have subtransactions I don't think we
can do that.)

However, the race condition comes in when someone wants to delete the
row at about the same time as someone else is trying to mark it known
good, ie, sometime *after* the originating transaction committed.
This is definitely a possible situation.

regards, tom lane

#91Sailesh Krishnamurthy
sailesh@cs.berkeley.edu
In reply to: Jonah H. Harris (#32)
Re: Much Ado About COUNT(*)

"Jonah" == Jonah H Harris <jharris@tvi.edu> writes:

Jonah> Replying to the list as a whole:

Jonah> If this is such a bad idea, why do other database systems
Jonah> use it? As a businessperson myself, it doesn't seem
Jonah> logical to me that commercial database companies would
Jonah> spend money on implementing this feature if it wouldn't be
Jonah> used. Remember guys, I'm just trying to help.

Systems like DB2 don't implement versioning schemes. As a result there
is no need to worry about maintaining visibility in
indexes. Index-only plans are thus viable as they require no change in
the physical structure of the index and no overhead on
update/delete/insert ops.

I don't know about Oracle, which I gather is the only commercial
system to have something like MVCC.

--
Pip-pip
Sailesh
http://www.cs.berkeley.edu/~sailesh

#92Jeff Davis
jdavis-pgsql@empires.org
In reply to: Sailesh Krishnamurthy (#91)
Re: Much Ado About COUNT(*)

On Tue, 2005-01-18 at 12:45 -0800, Sailesh Krishnamurthy wrote:

"Jonah" == Jonah H Harris <jharris@tvi.edu> writes:

Jonah> Replying to the list as a whole:

Jonah> If this is such a bad idea, why do other database systems
Jonah> use it? As a businessperson myself, it doesn't seem
Jonah> logical to me that commercial database companies would
Jonah> spend money on implementing this feature if it wouldn't be
Jonah> used. Remember guys, I'm just trying to help.

Systems like DB2 don't implement versioning schemes. As a result there
is no need to worry about maintaining visibility in
indexes. Index-only plans are thus viable as they require no change in
the physical structure of the index and no overhead on
update/delete/insert ops.

I don't know about Oracle, which I gather is the only commercial
system to have something like MVCC.

Perhaps firebird/interbase also? Someone mentioned that on these lists,
I'm not sure if it's true or not.

I almost think to not supply an MVCC system would break the "I" in ACID,
would it not? I can't think of any other obvious way to isolate the
transactions, but on the other hand, wouldn't DB2 want to be ACID
compliant?

Regards,
Jeff Davis

#93Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Davis (#92)
Re: Much Ado About COUNT(*)

Jeff Davis <jdavis-pgsql@empires.org> writes:

I almost think to not supply an MVCC system would break the "I" in ACID,
would it not?

Certainly not; ACID was a recognized goal long before anyone thought of
MVCC. You do need much more locking to make it work without MVCC,
though --- for instance, a reader that is interested in a just-modified
row has to block until the writer completes or rolls back.

People who hang around Postgres too long tend to think that MVCC is the
obviously correct way to do things, but much of the rest of the world
thinks differently ;-)

regards, tom lane

#94Sailesh Krishnamurthy
sailesh@cs.berkeley.edu
In reply to: Tom Lane (#93)
Re: Much Ado About COUNT(*)

"Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:

Tom> People who hang around Postgres too long tend to think that
Tom> MVCC is the obviously correct way to do things, but much of
Tom> the rest of the world thinks differently ;-)

It works the other way too ... people who come from the locking world
find it difficult to wrap their heads around MVCC. A big part of this
is because Gray's original paper on transaction isolation defines the
different levels based on what kind of lock acquisitions they involve.

A very nice alternative approach to defining transaction isolation is
"Generalized isolation level definitions" by Adya, Liskov and O'Neill
that appears in ICDE 2000.

--
Pip-pip
Sailesh
http://www.cs.berkeley.edu/~sailesh

#95Jeff Davis
jdavis-pgsql@empires.org
In reply to: Tom Lane (#93)
Re: Much Ado About COUNT(*)

Certainly not; ACID was a recognized goal long before anyone thought of
MVCC. You do need much more locking to make it work without MVCC,
though --- for instance, a reader that is interested in a just-modified
row has to block until the writer completes or rolls back.

People who hang around Postgres too long tend to think that MVCC is the
obviously correct way to do things, but much of the rest of the world
thinks differently ;-)

Well, that would explain why everyone is so happy with PostgreSQL's
concurrent access performance.

Thanks for the information, although I'm not sure I wanted to be
reminded about complicated locking issues ( I suppose I must have known
that at one time, but perhaps I surpressed it ;-)

Regards,
Jeff Davis

#96Christopher Browne
cbbrowne@acm.org
In reply to: Jonah H. Harris (#24)
Re: Much Ado About COUNT(*)

Martha Stewart called it a Good Thing when jdavis-pgsql@empires.org (Jeff Davis) wrote:

I almost think to not supply an MVCC system would break the "I" in ACID,
would it not? I can't think of any other obvious way to isolate the
transactions, but on the other hand, wouldn't DB2 want to be ACID
compliant?

Wrong, wrong, wrong...

MVCC allows an ACID implementation to not need to do a lot of resource
locking.

In the absence of MVCC, you have way more locks outstanding, which
makes it easier for there to be conflicts between lock requests.

In effect, with MVCC, you can do more things concurrently without the
system crumbling due to a surfeit of deadlocks.
--
"cbbrowne","@","gmail.com"
http://www3.sympatico.ca/cbbrowne/multiplexor.html
Why isn't phonetic spelled the way it sounds?

#97Mark Cave-Ayland
m.cave-ayland@webbased.co.uk
In reply to: Christopher Browne (#96)
Re: Much Ado About COUNT(*)

Date: Wed, 12 Jan 2005 18:45:09 -0800
From: Jeff Davis <jdavis-pgsql@empires.org>
To: Alvaro Herrera <alvherre@dcc.uchile.cl>
Cc: pgsql-hackers@postgresql.org
Subject: Re: Much Ado About COUNT(*)
Message-ID: <1105584309.2886.410.camel@jeff>

(cut)

Thanks for the link. It looks like it breaks it up into chunks of about

2KB. I think the

conversation was mostly assuming the tables were somewhat closer to the

size of an

index. If you have more than 2KB per tuple, pretty much anything you do

with an index

would be faster I would think.

Hi Jeff/Alvaro,

I'm considering an application at the moment whereby I would need to do lots
of COUNT(*) on lots of separate tables without a WHERE clause. Would
something like the following help speed up the COUNT(*) by reducing the
tuple size being used for the count?

CREATE SEQUENCE id_seq;

CREATE TABLE person_count (
id int8
);

CREATE TABLE person (
id int8 DEFAULT nextval('id_seq');
first_name text,
surname text,
age int,
address1 text,
address2 text,
address3 text,
address4 text,
postcode text
tel text
);

For each insert:

BEGIN;
INSERT INTO person (first_name, .... Tel) VALUES ('Fred', ....
'12345');
INSERT INTO person_count(id) VALUES (currval('id_seq'));
COMMIT;

So then I would use SELECT COUNT(*) FROM person_count whenever I wanted to
know the current number of person records. How much quicker would a COUNT(*)
be if visibility were included in the indices as opposed to a "hacked"
approach like this?

Many thanks,

Mark.

------------------------
WebBased Ltd
South West Technology Centre
Tamar Science Park
Plymouth
PL6 8BT

T: +44 (0)1752 791021
F: +44 (0)1752 791023
W: http://www.webbased.co.uk

#98Bruno Wolff III
bruno@wolff.to
In reply to: Mark Cave-Ayland (#97)
Re: Much Ado About COUNT(*)

On Wed, Jan 19, 2005 at 14:59:17 -0000,
Mark Cave-Ayland <m.cave-ayland@webbased.co.uk> wrote:

BEGIN;
INSERT INTO person (first_name, .... Tel) VALUES ('Fred', ....
'12345');
INSERT INTO person_count(id) VALUES (currval('id_seq'));
COMMIT;

So then I would use SELECT COUNT(*) FROM person_count whenever I wanted to
know the current number of person records. How much quicker would a COUNT(*)
be if visibility were included in the indices as opposed to a "hacked"
approach like this?

You are only going to get a constant factor speed up unless the space savings
allows much better use of cache. You probably want to look at using
triggers to maintain counts in another table.

#99Alvaro Herrera
alvherre@dcc.uchile.cl
In reply to: Bruno Wolff III (#98)
Re: Much Ado About COUNT(*)

On Wed, Jan 19, 2005 at 10:16:38AM -0600, Bruno Wolff III wrote:

On Wed, Jan 19, 2005 at 14:59:17 -0000,
Mark Cave-Ayland <m.cave-ayland@webbased.co.uk> wrote:

So then I would use SELECT COUNT(*) FROM person_count whenever I wanted to
know the current number of person records. How much quicker would a COUNT(*)
be if visibility were included in the indices as opposed to a "hacked"
approach like this?

You are only going to get a constant factor speed up unless the space savings
allows much better use of cache. You probably want to look at using
triggers to maintain counts in another table.

I'd try using a "start value" and a differences list. So the
differences list would be initially empty and the start value would be
0. On insert or delete, you create a new difference (with +1 or
whatever). Periodically, the differences would be added to the start
value and the records deleted. Thus the time to calculate the total
count is much smaller, and it follows MVCC rules. Of course there are
lots of minor details not mentioned here.

Not sure if I'd model this with a single table or two.

--
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
"I would rather have GNU than GNOT." (ccchips, lwn.net/Articles/37595/)

#100Jeff Davis
jdavis-pgsql@empires.org
In reply to: Alvaro Herrera (#99)
Re: Much Ado About COUNT(*)

To fill in some details I think what he's saying is this:

=> create table foo(...);
=> create table foo_count(num int);
=> insert into foo_count values(0);
=> create table foo_change(num int);

then create a trigger "after delete on foo" that does "insert into
foo_change values(-1)" and a trigger "after insert on foo" that inserts
a +1 into foo_change.

Periodically, do:
=> begin;
=> set transaction isolation level serializable;
=> update foo_count set num=num+(select sum(num) from foo_change);
=> delete from foo_change;
=> commit;
=> VACUUM;

And then any time you need the correct count(*) value, do instead:
=> select sum(num) from (select num from foo_count union select num from
foo_change);

And that should work. I haven't tested this exact example, so I may have
overlooked something.

Hope that helps. That way, you don't have huge waste from the second
table, and also triggers maintain it for you and you don't need to think
about it.

Regards,
Jeff Davis

Show quoted text

On Wed, 2005-01-19 at 17:40 -0300, Alvaro Herrera wrote:

On Wed, Jan 19, 2005 at 10:16:38AM -0600, Bruno Wolff III wrote:

On Wed, Jan 19, 2005 at 14:59:17 -0000,
Mark Cave-Ayland <m.cave-ayland@webbased.co.uk> wrote:

So then I would use SELECT COUNT(*) FROM person_count whenever I wanted to
know the current number of person records. How much quicker would a COUNT(*)
be if visibility were included in the indices as opposed to a "hacked"
approach like this?

You are only going to get a constant factor speed up unless the space savings
allows much better use of cache. You probably want to look at using
triggers to maintain counts in another table.

I'd try using a "start value" and a differences list. So the
differences list would be initially empty and the start value would be
0. On insert or delete, you create a new difference (with +1 or
whatever). Periodically, the differences would be added to the start
value and the records deleted. Thus the time to calculate the total
count is much smaller, and it follows MVCC rules. Of course there are
lots of minor details not mentioned here.

Not sure if I'd model this with a single table or two.

#101Mark Cave-Ayland
m.cave-ayland@webbased.co.uk
In reply to: Jeff Davis (#100)
Re: Much Ado About COUNT(*)

-----Original Message-----
From: Jeff Davis [mailto:jdavis-pgsql@empires.org]
Sent: 19 January 2005 21:33
To: Alvaro Herrera
Cc: Mark Cave-Ayland; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Much Ado About COUNT(*)

To fill in some details I think what he's saying is this:

=> create table foo(...);
=> create table foo_count(num int);
=> insert into foo_count values(0);
=> create table foo_change(num int);

then create a trigger "after delete on foo" that does "insert
into foo_change values(-1)" and a trigger "after insert on
foo" that inserts a +1 into foo_change.

Periodically, do:
=> begin;
=> set transaction isolation level serializable;
=> update foo_count set num=num+(select sum(num) from
foo_change); => delete from foo_change; => commit; => VACUUM;

And then any time you need the correct count(*) value, do
instead: => select sum(num) from (select num from foo_count
union select num from foo_change);

And that should work. I haven't tested this exact example, so
I may have overlooked something.

Hope that helps. That way, you don't have huge waste from the
second table, and also triggers maintain it for you and you
don't need to think about it.

Regards,
Jeff Davis

Hi Jeff,

Thanks for the information. I seem to remember something similar to this
being discussed last year in a similar thread. My only real issue I can see
with this approach is that the trigger is fired for every row, and it is
likely that the database I am planning will have large inserts of several
hundred thousand records. Normally the impact of these is minimised by
inserting the entire set in one transaction. Is there any way that your
trigger can be modified to fire once per transaction with the number of
modified rows as a parameter?

Many thanks,

Mark.

------------------------
WebBased Ltd
South West Technology Centre
Tamar Science Park
Plymouth
PL6 8BT

T: +44 (0)1752 791021
F: +44 (0)1752 791023
W: http://www.webbased.co.uk

#102D'Arcy J.M. Cain
darcy@druid.net
In reply to: Mark Cave-Ayland (#101)
Re: Much Ado About COUNT(*)

On Thu, 20 Jan 2005 10:12:17 -0000
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> wrote:

Thanks for the information. I seem to remember something similar to
this being discussed last year in a similar thread. My only real issue
I can see with this approach is that the trigger is fired for every
row, and it is likely that the database I am planning will have large
inserts of several hundred thousand records. Normally the impact of
these is minimised by inserting the entire set in one transaction. Is
there any way that your trigger can be modified to fire once per
transaction with the number of modified rows as a parameter?

I don't believe that such a facility exists but before dismissing it you
should test it out. I think that you will find that disk buffering (the
system's as well as PostgreSQL's) will effectively handle this for you
anyway.

-- 
D'Arcy J.M. Cain <darcy@druid.net>         |  Democracy is three wolves
http://www.druid.net/darcy/                |  and a sheep voting on
+1 416 425 1212     (DoD#0082)    (eNTP)   |  what's for dinner.
#103Richard Huxton
dev@archonet.com
In reply to: D'Arcy J.M. Cain (#102)
Re: Much Ado About COUNT(*)

D'Arcy J.M. Cain wrote:

On Thu, 20 Jan 2005 10:12:17 -0000
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> wrote:

Thanks for the information. I seem to remember something similar to
this being discussed last year in a similar thread. My only real issue
I can see with this approach is that the trigger is fired for every
row, and it is likely that the database I am planning will have large
inserts of several hundred thousand records. Normally the impact of
these is minimised by inserting the entire set in one transaction. Is
there any way that your trigger can be modified to fire once per
transaction with the number of modified rows as a parameter?

I don't believe that such a facility exists but before dismissing it you
should test it out. I think that you will find that disk buffering (the
system's as well as PostgreSQL's) will effectively handle this for you
anyway.

Well, it looks like ROW_COUNT isn't set in a statement-level trigger
function (GET DIAGNOSTICS myvar=ROW_COUNT). Which is a shame, otherwise
it would be easy to handle. It should be possible to expose this
information though, since it gets reported at the command conclusion.

--
Richard Huxton
Archonet Ltd

-- stmt_trig_test.sql --
BEGIN;

CREATE TABLE trigtest (
a int4 NOT NULL,
b text,
PRIMARY KEY (a)
);

CREATE FUNCTION tt_test_fn() RETURNS TRIGGER AS '
DECLARE
nr integer;
ro integer;
nr2 integer;
BEGIN
GET DIAGNOSTICS nr = ROW_COUNT;
GET DIAGNOSTICS ro = RESULT_OID;
SELECT count(*) INTO nr2 FROM trigtest;

RAISE NOTICE ''nr = % / ro = % / nr2 = %'',nr,ro,nr2;

RETURN NULL;
END;
' LANGUAGE plpgsql;

CREATE TRIGGER tt_test AFTER INSERT OR UPDATE ON trigtest
FOR EACH STATEMENT
EXECUTE PROCEDURE tt_test_fn();

INSERT INTO trigtest VALUES (1,'a');
INSERT INTO trigtest VALUES (2,'b');
UPDATE trigtest SET b = 'x';

ROLLBACK;

#104Mark Cave-Ayland
m.cave-ayland@webbased.co.uk
In reply to: Richard Huxton (#103)
Re: Much Ado About COUNT(*)

-----Original Message-----
From: Richard Huxton [mailto:dev@archonet.com]
Sent: 20 January 2005 12:45
To: D'Arcy J.M. Cain
Cc: Mark Cave-Ayland; jdavis-pgsql@empires.org;
alvherre@dcc.uchile.cl; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Much Ado About COUNT(*)

D'Arcy J.M. Cain wrote:

On Thu, 20 Jan 2005 10:12:17 -0000
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> wrote:

Thanks for the information. I seem to remember something similar to
this being discussed last year in a similar thread. My only

real issue

I can see with this approach is that the trigger is fired for every
row, and it is likely that the database I am planning will

have large

inserts of several hundred thousand records. Normally the impact of
these is minimised by inserting the entire set in one

transaction. Is

there any way that your trigger can be modified to fire once per
transaction with the number of modified rows as a parameter?

I don't believe that such a facility exists but before

dismissing it

you should test it out. I think that you will find that disk
buffering (the system's as well as PostgreSQL's) will effectively
handle this for you anyway.

Well, it looks like ROW_COUNT isn't set in a statement-level trigger
function (GET DIAGNOSTICS myvar=ROW_COUNT). Which is a shame,
otherwise
it would be easy to handle. It should be possible to expose this
information though, since it gets reported at the command conclusion.

Hi Richard,

This is more the sort of approach I would be looking for. However I think
even in a transaction with ROW_COUNT defined, the trigger will still be
called once per insert. I think something like this would require a new
syntax like below, and some supporting code that would keep track of the
tables touched by a transaction :(

CREATE TRIGGER tt_test AFTER TRANSACTION ON trigtest
FOR EACH TRANSACTION
EXECUTE PROCEDURE tt_test_fn();

I am sure that Jeff's approach will work, however it just seems like writing
out one table entry per row is going to slow large bulk inserts right down.

Kind regards,

Mark.

------------------------
WebBased Ltd
South West Technology Centre
Tamar Science Park
Plymouth
PL6 8BT

T: +44 (0)1752 791021
F: +44 (0)1752 791023
W: http://www.webbased.co.uk

#105Alvaro Herrera
alvherre@dcc.uchile.cl
In reply to: Mark Cave-Ayland (#104)
Re: Much Ado About COUNT(*)

On Thu, Jan 20, 2005 at 01:33:10PM -0000, Mark Cave-Ayland wrote:

I am sure that Jeff's approach will work, however it just seems like writing
out one table entry per row is going to slow large bulk inserts right down.

I don't see how it is any slower than the approach of inserting one
entry per row in the narrow table the OP wanted to use. And it will be
faster for the scans.

--
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
Officer Krupke, what are we to do?
Gee, officer Krupke, Krup you! (West Side Story, "Gee, Officer Krupke")

#106Richard Huxton
dev@archonet.com
In reply to: Mark Cave-Ayland (#104)
Re: Much Ado About COUNT(*)

Mark Cave-Ayland wrote:

-----Original Message-----
From: Richard Huxton [mailto:dev@archonet.com]
Sent: 20 January 2005 12:45
To: D'Arcy J.M. Cain
Cc: Mark Cave-Ayland; jdavis-pgsql@empires.org;
alvherre@dcc.uchile.cl; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Much Ado About COUNT(*)

D'Arcy J.M. Cain wrote:

On Thu, 20 Jan 2005 10:12:17 -0000
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> wrote:

Thanks for the information. I seem to remember something similar to
this being discussed last year in a similar thread. My only

real issue

I can see with this approach is that the trigger is fired for every
row, and it is likely that the database I am planning will

have large

inserts of several hundred thousand records. Normally the impact of
these is minimised by inserting the entire set in one

transaction. Is

there any way that your trigger can be modified to fire once per
transaction with the number of modified rows as a parameter?

I don't believe that such a facility exists but before

dismissing it

you should test it out. I think that you will find that disk
buffering (the system's as well as PostgreSQL's) will effectively
handle this for you anyway.

Well, it looks like ROW_COUNT isn't set in a statement-level trigger
function (GET DIAGNOSTICS myvar=ROW_COUNT). Which is a shame,
otherwise
it would be easy to handle. It should be possible to expose this
information though, since it gets reported at the command conclusion.

Hi Richard,

This is more the sort of approach I would be looking for. However I think
even in a transaction with ROW_COUNT defined, the trigger will still be
called once per insert. I think something like this would require a new
syntax like below, and some supporting code that would keep track of the
tables touched by a transaction :(

Well, a statement-level trigger would be called once per statement,
which can be much less than per row.

--
Richard Huxton
Archonet Ltd

#107Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#56)
Re: Much Ado About COUNT(*)

Added to TODO based on this discusion:

---------------------------------------------------------------------------

* Speed up COUNT(*)

We could use a fixed row count and a +/- count to follow MVCC
visibility rules, or a single cached value could be used and
invalidated if anyone modifies the table. Another idea is to <--
get a count directly from a unique index, but for this to be
faster than a sequential scan it must avoid access to the heap
to obtain tuple visibility information.

* Allow data to be pulled directly from indexes

Currently indexes do not have enough tuple tuple visibility
information to allow data to be pulled from the index without
also accessing the heap. One way to allow this is to set a bit
to index tuples to indicate if a tuple is currently visible to
all transactions when the first valid heap lookup happens. This
bit would have to be cleared when a heap tuple is expired.

---------------------------------------------------------------------------

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Ah, right, I missed the connection. Hmm ... that's sort of the inverse
of the "killed tuple" optimization we put in a release or two back,
where an index tuple is marked as definitely dead once it's committed
dead and the deletion is older than all active transactions.

Yes, it is sort of the reverse, but how do you get around the delete
case?

A would-be deleter of a tuple would have to go and clear the "known
good" bits on all the tuple's index entries before it could commit.
This would bring the tuple back into the "uncertain status" condition
where backends would have to visit the heap to find out what's up.
Eventually the state would become certain again (either dead to
everyone or live to everyone) and one or the other hint bit could be
set again.

The ugly part of this is that clearing the bit is not like setting a
hint bit, ie it's not okay if we lose that change. Therefore, each
bit-clearing would have to be WAL-logged. This is a big part of my
concern about the cost.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#108Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#83)
Re: Much Ado About COUNT(*)

Tom Lane wrote:

Tom Lane <tgl@sss.pgh.pa.us> writes:

Manfred Koizar <mkoi-pg@aon.at> writes:

Last time we discussed this, didn't we come to the conclusion, that
resetting status bits is not a good idea because of possible race
conditions?

There's no race condition,

Actually, wait a minute --- you have a point. Consider a tuple whose
inserting transaction (A) has just dropped below GlobalXmin.
Transaction B is doing an index scan, so it's going to do something like

* Visit index entry, observe that it is in "uncertain" state.
* Visit heap tuple, observe that A has committed and is < GlobalXmin,
and there is no deleter.
* Return to index entry and mark it "visible to all".

Now suppose transaction C decides to delete the tuple. It will

* Insert itself as the XMAX of the heap tuple.
* Visit index entry, set state to "uncertain" if not already that way.

C could do this between steps 2 and 3 of B, in which case the index
entry ends up improperly marked "visible to all" while in fact a
deletion is pending. Ugh. We'd need some kind of interlock to prevent
this from happening, and it's not clear what. Might be tricky to create
such an interlock without introducing either deadlock or a big
performance penalty.

I am thinking we have to somehow lock the row while we set the index
status bit. We could add a new heap bit that says "my xid is going to
set the status bit" and put our xid in the expired location, set the
bit, then return to the heap and clear it.

Can we keep the heap and index page locked at the same time?

Anyway it is clearly something that could be an issue.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#109Ron Mayer
rm_pg@cheapcomplexdevices.com
In reply to: Bruce Momjian (#107)
Re: Much Ado About COUNT(*)

Bruce Momjian wrote:

Added to TODO based on this discusion:...
* Speed up COUNT(*)

One think I think would help lots of people is if the
documentation near the COUNT aggregate explained some
of the techniques using triggers to maintain a count
for tables where this is important.

For every one person who reads the mailinglist archives,
I bet there are dozens who merely read the product
documentation and never find the workaround/solution.

Perhaps a note below the table here:
http://www.postgresql.org/docs/8.0/interactive/functions-aggregate.html#FUNCTIONS-AGGREGATE-TABLE
would be a good place. If it is already somewhere in
the docs; perhaps the page I linked should refer to the
other page as well.

#110Jim C. Nasby
decibel@decibel.org
In reply to: Ron Mayer (#109)
Re: Much Ado About COUNT(*)

On Sun, Jan 23, 2005 at 12:36:10AM -0800, Ron Mayer wrote:

Bruce Momjian wrote:

Added to TODO based on this discusion:...
* Speed up COUNT(*)

One think I think would help lots of people is if the
documentation near the COUNT aggregate explained some
of the techniques using triggers to maintain a count
for tables where this is important.

For every one person who reads the mailinglist archives,
I bet there are dozens who merely read the product
documentation and never find the workaround/solution.

Perhaps a note below the table here:
http://www.postgresql.org/docs/8.0/interactive/functions-aggregate.html#FUNCTIONS-AGGREGATE-TABLE
would be a good place. If it is already somewhere in
the docs; perhaps the page I linked should refer to the
other page as well.

Does anyone have working code they could contribute? It would be best to
give at least an example in the docs. Even better would be something in
pgfoundry that helps build a summary table and the rules/triggers you
need to maintain it.
--
Jim C. Nasby, Database Consultant decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

#111Mark Kirkwood
markir@coretech.co.nz
In reply to: Jim C. Nasby (#110)
Re: Much Ado About COUNT(*)

Jim C. Nasby wrote:

Does anyone have working code they could contribute? It would be best to
give at least an example in the docs. Even better would be something in
pgfoundry that helps build a summary table and the rules/triggers you
need to maintain it.

http://developer.postgresql.org/docs/postgres/plpgsql-trigger.html#PLPGSQL-TRIGGER-SUMMARY-EXAMPLE

regards

Mark

#112Jonah H. Harris
jharris@tvi.edu
In reply to: Mark Kirkwood (#111)
Re: Much Ado About COUNT(*)

Here's a possible solution... though I'm not sure about whether you find
the pg_ prefix appropriate for this context.

-- Create a Test Relation
CREATE TABLE test_tbl (
test_id BIGINT NOT NULL,
test_value VARCHAR(128) NOT NULL,
PRIMARY KEY (test_id));

-- Create COUNT Collector Relation
CREATE TABLE pg_user_table_counts (
schemaname VARCHAR(64) NOT NULL,
tablename VARCHAR(64) NOT NULL,
rowcount BIGINT NOT NULL DEFAULT 0,
PRIMARY KEY (schemaname, tablename));

-- Populate Collector Relation
INSERT INTO pg_user_table_counts (schemaname, tablename)
(SELECT
schemaname,
tablename
FROM
pg_tables
WHERE
schemaname != 'pg_catalog'
AND schemaname != 'information_schema'
AND tablename != 'pg_user_table_counts'
)
;

-- Create our Increment/Decrement Function
CREATE OR REPLACE FUNCTION pg_user_table_count_func () RETURNS TRIGGER
AS $pg_user_table_count_func$
DECLARE
this_schemaname VARCHAR(64);
BEGIN

SELECT INTO this_schemaname
nspname
FROM
pg_namespace
WHERE
oid = (SELECT
relnamespace
FROM
pg_class
WHERE
oid = TG_RELID);

-- Decrement Count
IF (TG_OP = 'DELETE') THEN

UPDATE pg_user_table_counts
SET rowcount = rowcount - 1
WHERE schemaname = this_schemaname
AND tablename = TG_RELNAME;

ELSIF (TG_OP = 'INSERT') THEN

UPDATE pg_user_table_counts
SET rowcount = rowcount + 1
WHERE schemaname = this_schemaname
AND tablename = TG_RELNAME;

END IF;
RETURN NULL;
END;
$pg_user_table_count_func$ LANGUAGE plpgsql;

-- Create AFTER INSERT/UPDATE Trigger on our Test Table
CREATE TRIGGER test_tbl_aidt
AFTER INSERT OR DELETE ON test_tbl
FOR EACH ROW EXECUTE PROCEDURE pg_user_table_count_func();

-- INSERT to Test Relation
INSERT INTO test_tbl VALUES (1, 'Demo INSERT');

-- Query Collector
demodb=# SELECT * FROM pg_user_table_counts;
schemaname | tablename | rowcount
------------+-----------------+----------
public | test_tbl | 1
(1 row)

-- DELETE from Test Relation
DELETE FROM test_tbl;

-- Query Collector
emodb=# SELECT * FROM pg_user_table_counts;
schemaname | tablename | rowcount
------------+-----------------+----------
public | test_tbl | 0
(1 row)

Mark Kirkwood wrote:

Show quoted text

Jim C. Nasby wrote:

Does anyone have working code they could contribute? It would be best to
give at least an example in the docs. Even better would be something in
pgfoundry that helps build a summary table and the rules/triggers you
need to maintain it.

http://developer.postgresql.org/docs/postgres/plpgsql-trigger.html#PLPGSQL-TRIGGER-SUMMARY-EXAMPLE

regards

Mark

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if
your
joining column's datatypes do not match

#113Manfred Koizar
mkoi-pg@aon.at
In reply to: Jonah H. Harris (#112)
Re: Much Ado About COUNT(*)

On Mon, 24 Jan 2005 08:28:09 -0700, "Jonah H. Harris" <jharris@tvi.edu>
wrote:

UPDATE pg_user_table_counts
SET rowcount = rowcount + 1
WHERE schemaname = this_schemaname
AND tablename = TG_RELNAME;

This might work for small single user applications. You'll have to keep
an eye on dead tuples in pg_user_table_counts though.

But as soon as there are several concurrent transactions doing both
INSERTs and DELETEs, your solution will in the best case serialise
access to test_tbl or it will break down because of deadlocks.

Servus
Manfred

#114Christopher Browne
cbbrowne@acm.org
In reply to: Tom Lane (#56)
Re: Much Ado About COUNT(*)

Centuries ago, Nostradamus foresaw when mkoi-pg@aon.at (Manfred Koizar) would write:

On Mon, 24 Jan 2005 08:28:09 -0700, "Jonah H. Harris" <jharris@tvi.edu>
wrote:

UPDATE pg_user_table_counts
SET rowcount = rowcount + 1
WHERE schemaname = this_schemaname
AND tablename = TG_RELNAME;

This might work for small single user applications. You'll have to keep
an eye on dead tuples in pg_user_table_counts though.

But as soon as there are several concurrent transactions doing both
INSERTs and DELETEs, your solution will in the best case serialise
access to test_tbl or it will break down because of deadlocks.

At that point, what you need to do is to break the process in three:

1. Instead of the above, use...

insert into pg_user_table_counts (rowcount, schemaname,
tablename) values (1, this_schemaname, TG_RELNAME);

The process for DELETEs involves using the value -1, of course...

2. A process needs to run once in a while that does...

create temp table new_counts as
select sum(rowcount), schemaname, tablename from
pg_user_table_counts group by schemaname, tablename;
delete from pg_user_table_counts;
insert into pg_user_table_counts select * from new_counts;

This process "compresses" the table so that it becomes cheaper to
do the aggregate in 3.

3. Querying values is done differently...

select sum(rowcount) from pg_user_table_counts where schemaname =
'this' and tablename = 'that';
--
let name="cbbrowne" and tld="ntlug.org" in String.concat "@" [name;tld];;
http://www.ntlug.org/~cbbrowne/nonrdbms.html
Rules of the Evil Overlord #118. "If I have equipment which performs
an important function, it will not be activated by a lever that
someone could trigger by accidentally falling on when fatally
wounded." <http://www.eviloverlord.com/&gt;