Index-only-scans, indexam API changes
At the moment, amgettuple only returns pointers to heap tuples. There is
no way to return data from the index tuples. That needs to be changed to
support index-only scans.
I propose that we split index_getnext into two steps: fetching the next
match from the index (index_next()), and fetching the corresponding heap
tuple (index_fetch()). Patch attached. There is still a shorthand
index_getnext() that is compatible with the old function, but it now
just calls index_next()+index_fetch().
The new index_fetch() function can only fetch one match from a HOT
chain. That greatly simplifies the code in indexam.c. The only caller
needing to fetch more than one tuple from a HOT chain (= using
SnapshotAny) is CLUSTER, so I moved the HOT-chain following logic into
cluster.c, with small changes to heap_hot_search_buffer().
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Attachments:
split-index_gettuple-1.patchtext/x-diff; name=split-index_gettuple-1.patchDownload+308-428
On Mon, Jul 13, 2009 at 8:19 AM, Heikki
Linnakangas<heikki.linnakangas@enterprisedb.com> wrote:
I propose that we split index_getnext into two steps: fetching the next
match from the index (index_next()), and fetching the corresponding heap
tuple (index_fetch()).
A pretty trivial concern, but it seems confusing that the function to
fetch a tuple from the heap is called "index_fetch()"... Perhaps
something more explicit like index_fetch_heap_tuple() would be
clearer?
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
At the moment, amgettuple only returns pointers to heap tuples. There is
no way to return data from the index tuples. That needs to be changed to
support index-only scans.
What are you going to do for index types that don't store the original
data (e.g. hash)?
regards, tom lane
Tom Lane wrote:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
At the moment, amgettuple only returns pointers to heap tuples. There is
no way to return data from the index tuples. That needs to be changed to
support index-only scans.What are you going to do for index types that don't store the original
data (e.g. hash)?
They will obviously not be able to regurgitate index tuples. I have not
yet decided how that's going to be signaled. In the prototype patch I
have, I have hard-coded that only b-trees can do that. A new column
"amcanreturntuples" column in pg_am seems like the most straightforward way.
(This indexam API patch isn't bothered with that yet. It just splits
index_gettuple() into two)
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
Tom Lane wrote:
What are you going to do for index types that don't store the original
data (e.g. hash)?
They will obviously not be able to regurgitate index tuples. I have not
yet decided how that's going to be signaled.
Well, I think that's a pretty critical part of the API change.
(This indexam API patch isn't bothered with that yet. It just splits
index_gettuple() into two)
One thought here is that an AM call isn't really free, and doing two of
them instead of one mightn't be such a good idea. I would suggest
either having a separate AM entry point to get both bits of data
("amgettupledata"?) or adding an optional parameter to amgettuple.
A small attraction of the alternate entry point is that then you can
flag the "unsupported" case by putting a zero in that pgam column, as
indeed we already do for amgettuple; so you don't need an additional
bool column.
[ thinks a bit ... ] At least for GIST, it is possible that whether
data can be regurgitated will vary depending on the selected opclass.
Some opclasses use the STORAGE modifier and some don't. I am not sure
how hard we want to work to support flexibility there. Would it be
sufficient to hard-code the check as "pgam says the AM can do it,
and the opclass does not have a STORAGE property"? Or do we need
additional intelligence about GIST opclasses?
regards, tom lane
Tom Lane wrote:
One thought here is that an AM call isn't really free, and doing two of
them instead of one mightn't be such a good idea. I would suggest
either having a separate AM entry point to get both bits of data
("amgettupledata"?) or adding an optional parameter to amgettuple.
I'm thinking of adding a new flag to IndexScanDesc, "needIndexTuple".
When that is set, amgettuple stores a pointer to the IndexTuple in
another new field in IndexScanDesc, xs_itup. (In case of b-tree at
least, the IndexTuple is a palloc'd copy of the tuple on disk)
[ thinks a bit ... ] At least for GIST, it is possible that whether
data can be regurgitated will vary depending on the selected opclass.
Some opclasses use the STORAGE modifier and some don't. I am not sure
how hard we want to work to support flexibility there. Would it be
sufficient to hard-code the check as "pgam says the AM can do it,
and the opclass does not have a STORAGE property"? Or do we need
additional intelligence about GIST opclasses?
Well, the way I have it implemented is that the am is not required to
return the index tuple, even if requested. I implemented the B-tree
changes similar to how we implement "kill_prior_tuple": in btgettuple,
lock the index page and see if the tuple is still at the same position
that we remembered in the scan opaque struct. If not (because of
concurrent changes to the page), we give up and don't return the index
tuple. The executor will then perform a heap fetch as before.
Alternatively, we could copy all the matching index tuples to private
memory when we step to a new index page, but that seems pretty expensive
if we're only going to use the first few matching tuples (LIMIT), and I
fear the memory management gets complicated. But in any case, the GiST
issue would still be there.
Since we're discussing it, I'm attaching the prototype patch I have for
returning tuples from b-tree and using them to filter rows before heap
fetches. I was going to post it in the morning along with description
about the planner and executor changes, but here goes. It applies on top
of the indexam API patch I started this thread with.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Attachments:
indexfilter-1.patchtext/x-diff; name=indexfilter-1.patchDownload+589-135
On Mon, Jul 13, 2009 at 11:32 AM, Heikki
Linnakangas<heikki.linnakangas@enterprisedb.com> wrote:
Tom Lane wrote:
One thought here is that an AM call isn't really free, and doing two of
them instead of one mightn't be such a good idea. I would suggest
either having a separate AM entry point to get both bits of data
("amgettupledata"?) or adding an optional parameter to amgettuple.I'm thinking of adding a new flag to IndexScanDesc, "needIndexTuple".
When that is set, amgettuple stores a pointer to the IndexTuple in
another new field in IndexScanDesc, xs_itup. (In case of b-tree at
least, the IndexTuple is a palloc'd copy of the tuple on disk)[ thinks a bit ... ] At least for GIST, it is possible that whether
data can be regurgitated will vary depending on the selected opclass.
Some opclasses use the STORAGE modifier and some don't. I am not sure
how hard we want to work to support flexibility there. Would it be
sufficient to hard-code the check as "pgam says the AM can do it,
and the opclass does not have a STORAGE property"? Or do we need
additional intelligence about GIST opclasses?Well, the way I have it implemented is that the am is not required to
return the index tuple, even if requested. I implemented the B-tree
changes similar to how we implement "kill_prior_tuple": in btgettuple,
lock the index page and see if the tuple is still at the same position
that we remembered in the scan opaque struct. If not (because of
concurrent changes to the page), we give up and don't return the index
tuple. The executor will then perform a heap fetch as before.Alternatively, we could copy all the matching index tuples to private
memory when we step to a new index page, but that seems pretty expensive
if we're only going to use the first few matching tuples (LIMIT), and I
fear the memory management gets complicated. But in any case, the GiST
issue would still be there.Since we're discussing it, I'm attaching the prototype patch I have for
returning tuples from b-tree and using them to filter rows before heap
fetches. I was going to post it in the morning along with description
about the planner and executor changes, but here goes. It applies on top
of the indexam API patch I started this thread with.
I'm not sure where we are on this patch for reviewing purposes.
Heikki, are you planning to provide an updated patch? Or what should
we be doing here from an RRR standpoint?
...Robert
[ thinks a bit ... ] At least for GIST, it is possible that whether
data can be regurgitated will vary depending on the selected opclass.
Some opclasses use the STORAGE modifier and some don't. I am not sure
how hard we want to work to support flexibility there. Would it be
sufficient to hard-code the check as "pgam says the AM can do it,
and the opclass does not have a STORAGE property"? Or do we need
additional intelligence about GIST opclasses?
GiST: btree_gist uses STORAGE option, although original value is accessible from
index's tuple.
GIN doesn't require setting of STORAGE option even if it's impossible to
reconstruct original heap value from indexed value. Right now, only btree_gin's
opclasses could be used for index only scans (and only for single-column index
scan!).
So, STORAGE option could not indicate "reconstruct-ability" original heap value
:(. It seems to me, opclass definition could contain boolean field about that,
but with STORAGE option specified it's needed to have separate "reconstructing"
interface method. IMHO, it's too complex for now and it doesn't give significant
benefits.
Although index-only scan over GIN/GiST could be useful for some aggregates
queries like count(*).
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/