GSoC 2015 proposal. Bitmap Index-only Count

Started by Anastasia Lubennikovaover 11 years ago5 messageshackers

a.lubennikova@postgrespro.ru

over 11 years ago

Hi, hackers!
Here is the text of my proposal which I've applied to GSoC.
(and link
http://www.google-melange.com/gsoc/proposal/public/google/gsoc2015/lubennikovaav/5657382461898752
)
Any suggestions and comments are welcome.

*Project name*

Bitmap Index-only Count

*Brief review*

There is a problem of slow counting in PostgreSQL [1]. The reason why this
is slow is related to the *MVCC* implementation in PostgreSQL. Index-only
scans (implemented since PostgreSQL-9.2) providing some performance
improvements where the *visibility map* of the table allows it. That’s
good. But it works only for access methods which provide amgettuple method.
Unfortunately GIN supports only BitmapIndexScan and has no implementation
of index_getnext() interface [2]. Idea of the proposal is to implement
Bitmap Index-Only Count method that allows to count elements, without
reference to the heap.

*Benefits to the PostgreSQL Community*

Faster count(*) would be actual for GIN. Probably it would accelerate
count(*) for other access methods too, but I do not sure if it would be
significant difference.

*Quantifiable results*

Acceleration of count(*) queries with WHERE clause which use GIN.

*Project details*

Let’s look at count(*) query:

EXPLAIN SELECT count(*) from tbl_mytbl where val @> '{"b":2}';

Now the query plan looks like this:

Aggregate

-> Bitmap Heap Scan on tbl_mytbl

Recheck Cond: (val @> '{"b": 2}'::jsonb)

-> Bitmap Index Scan on idx_mytbl

Index Cond: (val @> '{"b": 2}'::jsonb)

Idea of the proposal is to implement Bitmap Index-Only Count method that
allows to count elements, without reference to the heap.

Conditions:

- all tuples are visible (the same problem as for Index-only count(*));
- result TIDBitmap is not lossy. nchunks = 0;

int nchunks
<http://doxygen.postgresql.org/structTIDBitmap.html#a85d5883056bad6863cb47a15446581c7>;
/* number of lossy entries in pagetable */

- pages in TIDBitmap don’t require recheck

* recheck is used only on exact pages --- it indicates that although

* only the stated tuples need be checked, the full index qual condition

* must be checked for each (ie, these are candidate matches).

If all conditions are true, it’s possible to call aggregate COUNT function
for each tuple from TIDBitmap returned by Bitmap Index Scan (or
BitmapAnd/BitmapOr nodes). And there’s no necessity to call Bitmap Heap
Scan.

As a GSoC student I will create new Node “Bitmap Index-Only Scan”, which
would catch tuples from Bitmap Index Scan node and pass them to Aggregate
node. Thus, new query plan will be as follow:

Aggregate

-> Bitmap Index-only Count

-> Bitmap Index Scan on idx_mytbl

Index Cond: (val @> '{"b": 2}'::jsonb)

*Project Schedule*

until May 25

Read documentation and source code, clarify details of implementation.

until June 30

Implement Bitmap Index-only Count node.

1 July – 31 July

Add Bitmap Index-only Count node to Planner.

1 August -15 August

Final refactoring, testing and submitting a patch.

*Links*

1. https://wiki.postgresql.org/wiki/Slow_Counting
2.
http://postgresql.nabble.com/Todo-item-Support-amgettuple-in-GIN-td5780810.html

--
Best regards,
Lubennikova Anastasia

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: Anastasia Lubennikova (#1)

Re: GSoC 2015 proposal. Bitmap Index-only Count

Anastasia Lubennikova <lubennikovaav@gmail.com> writes:

There is a problem of slow counting in PostgreSQL [1]. The reason why this
is slow is related to the *MVCC* implementation in PostgreSQL. Index-only
scans (implemented since PostgreSQL-9.2) providing some performance
improvements where the *visibility map* of the table allows it. That’s
good. But it works only for access methods which provide amgettuple method.
Unfortunately GIN supports only BitmapIndexScan and has no implementation
of index_getnext() interface [2].

Right ...

As a GSoC student I will create new Node “Bitmap Index-Only Scan”, which
would catch tuples from Bitmap Index Scan node and pass them to Aggregate
node. Thus, new query plan will be as follow:

I'm pretty hesitant about adding a whole new plan node type (which will
require quite a lot of infrastructure) for such a narrow use-case.
I think the odds are good that if you proceed down this path, you will
end up with something that never gets committed to Postgres.

I wonder whether it'd be possible to teach GIN to support index_getnext
instead. Initially it would probably work only for cases where the
index didn't have to return any columns ... but if we did it, maybe the
door would be open to cases where GIN could reconstruct actual values.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

over 11 years ago

In reply to: Tom Lane (#2)

Re: GSoC 2015 proposal. Bitmap Index-only Count

2015-03-24 18:01 GMT+04:00 Tom Lane <tgl@sss.pgh.pa.us>:

Anastasia Lubennikova <lubennikovaav@gmail.com> writes:

There is a problem of slow counting in PostgreSQL [1]. The reason why

this

is slow is related to the *MVCC* implementation in PostgreSQL. Index-only
scans (implemented since PostgreSQL-9.2) providing some performance
improvements where the *visibility map* of the table allows it. That’s
good. But it works only for access methods which provide amgettuple

method.

Unfortunately GIN supports only BitmapIndexScan and has no implementation
of index_getnext() interface [2].

Right ...

As a GSoC student I will create new Node “Bitmap Index-Only Scan”, which
would catch tuples from Bitmap Index Scan node and pass them to Aggregate
node. Thus, new query plan will be as follow:

I'm pretty hesitant about adding a whole new plan node type (which will
require quite a lot of infrastructure) for such a narrow use-case.
I think the odds are good that if you proceed down this path, you will
end up with something that never gets committed to Postgres.

Thanks a lot for reply. It was just approximate idea. I thought is wasn't
very good.

I wonder whether it'd be possible to teach GIN to support index_getnext

instead. Initially it would probably work only for cases where the
index didn't have to return any columns ... but if we did it, maybe the
door would be open to cases where GIN could reconstruct actual values.

Another idea is to write index_getnext() for GIN which would return some

fake tuple. Since there is no difference for COUNT aggregate what the tuple
contains. COUNT just wants to know whether we have tuple that satisfy the
qual.
Is this idea better? Is it possible for planner to use index_getnext() for
GIN only with COUNT aggregate?

--
Best regards,
Lubennikova Anastasia

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: Anastasia Lubennikova (#3)

Re: GSoC 2015 proposal. Bitmap Index-only Count

Anastasia Lubennikova <lubennikovaav@gmail.com> writes:

2015-03-24 18:01 GMT+04:00 Tom Lane <tgl@sss.pgh.pa.us>:

I wonder whether it'd be possible to teach GIN to support index_getnext
instead. Initially it would probably work only for cases where the
index didn't have to return any columns ... but if we did it, maybe the
door would be open to cases where GIN could reconstruct actual values.

Another idea is to write index_getnext() for GIN which would return some
fake tuple. Since there is no difference for COUNT aggregate what the tuple
contains. COUNT just wants to know whether we have tuple that satisfy the
qual.

Well, yeah, that would be the idea (at least initially). You don't have
to return any real data unless you claim you can do so via amcanreturn.
The planner is still capable of selecting an index-only scan as long as
the query retrieves no columns.

The trick would be to not return the same heap TID more than once per
scan. A zero-order implementation would be to construct the same bitmap
we do now and then just provide a gingetnext function that scans through
that. That would be pretty awful in terms of scan startup time, so doing
better would be nice; but perhaps it would be useful even in that form.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Alexander Korotkov

aekorotkov@gmail.com

over 11 years ago

In reply to: Tom Lane (#4)

Re: GSoC 2015 proposal. Bitmap Index-only Count

On Wed, Mar 25, 2015 at 11:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Anastasia Lubennikova <lubennikovaav@gmail.com> writes:

2015-03-24 18:01 GMT+04:00 Tom Lane <tgl@sss.pgh.pa.us>:

I wonder whether it'd be possible to teach GIN to support index_getnext
instead. Initially it would probably work only for cases where the
index didn't have to return any columns ... but if we did it, maybe the
door would be open to cases where GIN could reconstruct actual values.

Another idea is to write index_getnext() for GIN which would return some
fake tuple. Since there is no difference for COUNT aggregate what the

tuple

contains. COUNT just wants to know whether we have tuple that satisfy the
qual.

Well, yeah, that would be the idea (at least initially). You don't have
to return any real data unless you claim you can do so via amcanreturn.
The planner is still capable of selecting an index-only scan as long as
the query retrieves no columns.

The trick would be to not return the same heap TID more than once per
scan. A zero-order implementation would be to construct the same bitmap
we do now and then just provide a gingetnext function that scans through
that. That would be pretty awful in terms of scan startup time, so doing
better would be nice; but perhaps it would be useful even in that form.

My ideal picture for FTS using GIN looks like this:

1) Have lexemes offsets in GIN stored with item pointers.
2) Calculate relevance using only GIN information without using heap.
3) Sort results by relevance in either GIN itself or executor node.
4) Get both TOP-N most relevant rows and total rows count (using index-only
scan) from single GIN index scan.

Implementing index_getnext() for GIN looks step forward for me because it
allows "index only count" and potentially could be used for ordered output.
However, it's unclear for me if it's feasible to have #4? Could we return
TOP-N results and total count from single GIN index scan?

------
With best regards,
Alexander Korotkov.