[RFC] speed up count(*)
Hi,
Perennially our users have complaints about slow count(*) when coming from
some other systems. Index-only scans help, but I think we can do better. I
recently wondered if a BRIN index could be used to answer min/max aggregate
queries over the whole table, and it turns out it doesn't. However, then it
occurred to me that if we had an opclass that keeps track of the count in
each page range, that would be a way to do a fast count(*) by creating the
right index. That would require planner support and other work, but it
seems doable. Any opinions on whether this is worth the effort?
--
John Naylor
EDB: http://www.enterprisedb.com
John Naylor <john.naylor@enterprisedb.com> writes:
Perennially our users have complaints about slow count(*) when coming from
some other systems. Index-only scans help, but I think we can do better. I
recently wondered if a BRIN index could be used to answer min/max aggregate
queries over the whole table, and it turns out it doesn't. However, then it
occurred to me that if we had an opclass that keeps track of the count in
each page range, that would be a way to do a fast count(*) by creating the
right index. That would require planner support and other work, but it
seems doable. Any opinions on whether this is worth the effort?
The core reason why this is hard is that we insist on giving the right
answer. In particular, count(*) is supposed to count the rows that
satisfy the asker's snapshot. So I don't see a good way to answer it
from an index only, given that we don't track visibility accurately
in indexes.
regards, tom lane
Hi,
On October 20, 2021 10:57:50 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
John Naylor <john.naylor@enterprisedb.com> writes:
Perennially our users have complaints about slow count(*) when coming from
some other systems. Index-only scans help, but I think we can do better. I
recently wondered if a BRIN index could be used to answer min/max aggregate
queries over the whole table, and it turns out it doesn't. However, then it
occurred to me that if we had an opclass that keeps track of the count in
each page range, that would be a way to do a fast count(*) by creating the
right index. That would require planner support and other work, but it
seems doable. Any opinions on whether this is worth the effort?The core reason why this is hard is that we insist on giving the right
answer. In particular, count(*) is supposed to count the rows that
satisfy the asker's snapshot. So I don't see a good way to answer it
from an index only, given that we don't track visibility accurately
in indexes.
Yeah.
If we really wanted to, we could accelerate unqualified count(*) substantially by computing the count inside heapam. There's a *lot* of overhead associated with returning tuples, grouping them, etc. Especially with all_visible set that's bound to be way faster (I'd guess are least 3-5x) if done in heapam (like we do the visibility determinations in heapgetpage for all tuples on a page at once).
But it's doubtful the necessary infrastructure is worth it. Perhaps that changes with the infrastructure some columnar AMs are asking for. They have a need to push more stuff down to the AM that's more generic than just count(*).
Regards,
Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
On 10/20/21 19:57, Tom Lane wrote:
John Naylor <john.naylor@enterprisedb.com> writes:
Perennially our users have complaints about slow count(*) when coming from
some other systems. Index-only scans help, but I think we can do better. I
recently wondered if a BRIN index could be used to answer min/max aggregate
queries over the whole table, and it turns out it doesn't. However, then it
occurred to me that if we had an opclass that keeps track of the count in
each page range, that would be a way to do a fast count(*) by creating the
right index. That would require planner support and other work, but it
seems doable. Any opinions on whether this is worth the effort?The core reason why this is hard is that we insist on giving the right
answer. In particular, count(*) is supposed to count the rows that
satisfy the asker's snapshot. So I don't see a good way to answer it
from an index only, given that we don't track visibility accurately
in indexes.
Couldn't we simply inspect the visibility map, use the index data only
for fully visible/summarized ranges, and inspect the heap for the
remaining pages? That'd still be a huge improvement for tables with most
only a few pages modified recently, which is a pretty common case.
I think the bigger issue is that people rarely do COUNT(*) on the whole
table. There are usually other conditions and/or GROUP BY, and I'm not
sure how would that work.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, Oct 20, 2021 at 2:23 PM Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:
Couldn't we simply inspect the visibility map, use the index data only
for fully visible/summarized ranges, and inspect the heap for the
remaining pages? That'd still be a huge improvement for tables with most
only a few pages modified recently, which is a pretty common case.I think the bigger issue is that people rarely do COUNT(*) on the whole
table. There are usually other conditions and/or GROUP BY, and I'm not
sure how would that work.
Right. My (possibly hazy) recollection is that people don't have quite as
high an expectation for queries with more complex predicates and/or
grouping. It would be interesting to see what the balance is.
--
John Naylor
EDB: http://www.enterprisedb.com
On 10/20/21 20:33, John Naylor wrote:
On Wed, Oct 20, 2021 at 2:23 PM Tomas Vondra
<tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>>
wrote:Couldn't we simply inspect the visibility map, use the index data only
for fully visible/summarized ranges, and inspect the heap for the
remaining pages? That'd still be a huge improvement for tables with most
only a few pages modified recently, which is a pretty common case.I think the bigger issue is that people rarely do COUNT(*) on the whole
table. There are usually other conditions and/or GROUP BY, and I'm not
sure how would that work.Right. My (possibly hazy) recollection is that people don't have quite
as high an expectation for queries with more complex predicates and/or
grouping. It would be interesting to see what the balance is.
I don't know where the balance is, and I doubt it's possible to answer
that in general - I'm sure some workloads might benefit significantly.
I wonder if multi-column BRIN indexes would help in cases with
additional predicates. Seems possible.
BTW you mentioned using BRIN indexes for min/max - I've been thinking
about using BRIN indexes for ordering/sorting, which seems related. And
I think it's actually doable, so I wonder why you concluded using BRIN
indexes for min/max is not possible?
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, Oct 20, 2021 at 2:41 PM Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:
BTW you mentioned using BRIN indexes for min/max - I've been thinking
about using BRIN indexes for ordering/sorting, which seems related. And
I think it's actually doable, so I wonder why you concluded using BRIN
indexes for min/max is not possible?
I just gathered it was not implemented in the planner, going by the explain
plan in the toy query I tried, and then I got the lightbulb in my head
about count(*).
--
John Naylor
EDB: http://www.enterprisedb.com
On 10/20/21 2:33 PM, John Naylor wrote:
On Wed, Oct 20, 2021 at 2:23 PM Tomas Vondra
<tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>>
wrote:Couldn't we simply inspect the visibility map, use the index data only
for fully visible/summarized ranges, and inspect the heap for the
remaining pages? That'd still be a huge improvement for tables with most
only a few pages modified recently, which is a pretty common case.I think the bigger issue is that people rarely do COUNT(*) on the whole
table. There are usually other conditions and/or GROUP BY, and I'm not
sure how would that work.Right. My (possibly hazy) recollection is that people don't have quite
as high an expectation for queries with more complex predicates and/or
grouping. It would be interesting to see what the balance is.
I think you are exactly correct. People seem to understand that with a
predicate it is harder, but they expect
select count(*) from foo;
to be nearly instantaneous, and they don't really need it to be exact.
The stock answer for that has been to do
select reltuples from pg_class
where relname = 'foo';
But that is unsatisfying because the problem is often with some
benchmark or another that cannot be changed.
I'm sure this idea will be shot down in flames <donning flameproof
suit>, but what if we had a default "off" GUC which could be turned on
causing the former to be transparently rewritten into the latter
</donning flameproof suit>?
Joe
--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development
On Thu, Oct 21, 2021 at 9:09 AM Joe Conway <mail@joeconway.com> wrote:
I think you are exactly correct. People seem to understand that with a
predicate it is harder, but they expectselect count(*) from foo;
to be nearly instantaneous, and they don't really need it to be exact.
The stock answer for that has been to doselect reltuples from pg_class
where relname = 'foo';But that is unsatisfying because the problem is often with some
benchmark or another that cannot be changed.
I would also expect it to almost always give the wrong answer.
--
Robert Haas
EDB: http://www.enterprisedb.com
On 10/21/21 4:06 PM, Robert Haas wrote:
On Thu, Oct 21, 2021 at 9:09 AM Joe Conway <mail@joeconway.com> wrote:
I think you are exactly correct. People seem to understand that with a
predicate it is harder, but they expectselect count(*) from foo;
to be nearly instantaneous, and they don't really need it to be exact.
The stock answer for that has been to doselect reltuples from pg_class
where relname = 'foo';But that is unsatisfying because the problem is often with some
benchmark or another that cannot be changed.I would also expect it to almost always give the wrong answer.
That is a grossly overstated position. When I have looked, it is often
not that terribly far off. And for many use cases that I have heard of
at least, quite adequate.
Joe
--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development
On Thu, Oct 21, 2021 at 4:19 PM Joe Conway <mail@joeconway.com> wrote:
That is a grossly overstated position. When I have looked, it is often
not that terribly far off. And for many use cases that I have heard of
at least, quite adequate.
I don't think it's grossly overstated. If you need an approximation it
may be good enough, but I don't think it will often be exactly correct
- probably only if the table is small and rarely modified.
--
Robert Haas
EDB: http://www.enterprisedb.com
On 10/21/21 4:23 PM, Robert Haas wrote:
On Thu, Oct 21, 2021 at 4:19 PM Joe Conway <mail@joeconway.com> wrote:
That is a grossly overstated position. When I have looked, it is often
not that terribly far off. And for many use cases that I have heard of
at least, quite adequate.I don't think it's grossly overstated. If you need an approximation it
may be good enough, but I don't think it will often be exactly correct
- probably only if the table is small and rarely modified.
meh -- the people who expect this to be impossibly fast don't typically
need or expect it to be exactly correct, and there is no way to make it
"exactly correct" in someone's snapshot without doing all the work.
That is why I didn't suggest making it the default. If you flip the
switch, you would get a very fast approximation. If you care about
accuracy, you accept it has to be slow.
Joe
--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development
On 10/21/21 4:29 PM, Joe Conway wrote:
On 10/21/21 4:23 PM, Robert Haas wrote:
On Thu, Oct 21, 2021 at 4:19 PM Joe Conway <mail@joeconway.com> wrote:
That is a grossly overstated position. When I have looked, it is often
not that terribly far off. And for many use cases that I have heard of
at least, quite adequate.I don't think it's grossly overstated. If you need an approximation it
may be good enough, but I don't think it will often be exactly correct
- probably only if the table is small and rarely modified.meh -- the people who expect this to be impossibly fast don't
typically need or expect it to be exactly correct, and there is no way
to make it "exactly correct" in someone's snapshot without doing all
the work.That is why I didn't suggest making it the default. If you flip the
switch, you would get a very fast approximation. If you care about
accuracy, you accept it has to be slow.
I don't think we really want a switch for "inaccurate results
acceptable", and I doubt the standard would accept an approximation for
count(*).
But something else that gave a fast approximate answer
("count_estimate(*)"?) would be useful to many. Not portable but still
useful, if someone could come up with a reasonable implementation.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
On Thu, Oct 21, 2021 at 4:29 PM Joe Conway <mail@joeconway.com> wrote:
meh -- the people who expect this to be impossibly fast don't typically
need or expect it to be exactly correct, and there is no way to make it
"exactly correct" in someone's snapshot without doing all the work.
I think it could actually be WAY faster than it is if, as Andres says,
we had the ability to push the count operation inside the heap AM. I
believe we have a tendency to attribute complaints like this to people
have unreasonable expectations, but here I'm not sure the expectation
is unreasonable. I vaguely recall writing a special-purpose code to
count the number of tuples in relation years ago, and IIRC it was
blazingly fast compared to letting our executor do it. I agree,
however, that an approximation can be faster still.
That is why I didn't suggest making it the default. If you flip the
switch, you would get a very fast approximation. If you care about
accuracy, you accept it has to be slow.
I'm not really here to take a position on the proposal. It doesn't
excite me, because I have not run across any users in the combination
of circumstances you mention: query can't be changed, exact answer not
actually required, whole table being counted. But I am not here to
call you a liar either. If you run across users in that situation all
the time, then you do.
--
Robert Haas
EDB: http://www.enterprisedb.com