About "Our CLUSTER implementation is pessimal" patch
Hi,
I read the thread "Our CLUSTER implementation is pessimal" http://archives.postgresql.org/pgsql-hackers/2008-08/msg01371.php .
I would like to try/integrate that patch as we use CLUSTER a lot on our system.
I was going to try to add the proper cost_index/cost_sort calls to decide which "path" should be executed, as in:
http://archives.postgresql.org/pgsql-hackers/2008-09/msg00517.php
I don't think it will be easy without help... I'll ask here a lot I'm afraid...
About that patch:
1) would it be possible to use the tuplesort_*tupleslot set of functions instead of writing new ones for HeapTuple? That is: is it that difficult/impossible/nonsense to construct TupleTableSlot from HeapTuple and use those?
2) The patch doesn't check "HeapTupleSatisfiesVacuum" before passing it to tuplesort_putrawtuple: would it be reasonable to check the "isdead" flag before calling tuplesort_putrawtuple for each tuple?
Sorry if I talked nonsense...
Leonardo
Leonardo F wrote:
I read the thread "Our CLUSTER implementation is pessimal" http://archives.postgresql.org/pgsql-hackers/2008-08/msg01371.php .
I would like to try/integrate that patch as we use CLUSTER a lot on our system.
Great!
About that patch:
1) would it be possible to use the tuplesort_*tupleslot set of functions instead of writing new ones for HeapTuple? That is: is it that difficult/impossible/nonsense to construct TupleTableSlot from HeapTuple and use those?
Yeah, I think you could do that, I agree it feels better that way.
You'll still need new copytup and comparetup functions, though, to deal
with HeapTupleHeaders instead of MinimalTuples, or modify the existing
ones to handle both. And some way to indicate that you want to preserve
the visibility information when you create the tuplesort, maybe a new
parameter to tuplesort_begin_heap().
2) The patch doesn't check "HeapTupleSatisfiesVacuum" before passing it to tuplesort_putrawtuple: would it be reasonable to check the "isdead" flag before calling tuplesort_putrawtuple for each tuple?
Yeah, seems reasonable, to avoid sorting dead tuples unnecessarily.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Yeah, I think you could do that, I agree it feels better that way.
You'll still need new copytup and comparetup functions, though, to deal
with HeapTupleHeaders instead of MinimalTuples, or modify the existing
ones to handle both.
You meant HeapTuple, not HeapTupleHeaders, right?
Mmh, didn't think of those two functions; I might as well start with Gregory
Stark's patch (that is: using HeapTuple)
And some way to indicate that you want to preserve
the visibility information when you create the tuplesort, maybe a new
parameter to tuplesort_begin_heap().
I guess that using Gregory Stark's patch there's no need for it, since it uses
HeapTuples, right?
A patch that:
1) uses always the old CLUSTER method for non-btree indexes and for
expression indexes
2) add a whole set of new functions to tuplesort (as in Gregory Stark's patch)
would be rejected "for sure"? Or can be thought as a "better than nothing,
works in 90% cases" patch?
Leonardo
Leonardo F wrote:
Yeah, I think you could do that, I agree it feels better that way.
You'll still need new copytup and comparetup functions, though, to deal
with HeapTupleHeaders instead of MinimalTuples, or modify the existing
ones to handle both.You meant HeapTuple, not HeapTupleHeaders, right?
No, I did mean HeapTupleHeader. MinimalTuple struct is cut-down version
HeapTupleHeader, while HeapTuple is structure that holds a pointer to
HeapTupleHeader + some extra information. SortTuple takes the role of
HeapTUple in tuplesort.c. A bit confusing, yes.
That said, I didn't really look closely, maybe I'm missing something and
HeapTuple is in fact the right struct to pass around.
And some way to indicate that you want to preserve
the visibility information when you create the tuplesort, maybe a new
parameter to tuplesort_begin_heap().I guess that using Gregory Stark's patch there's no need for it, since it uses
HeapTuples, right?
Hmm, you still need to set different comparison function in
Tuplesortstate->comparetup, so you'll still need a separate begin()
function too, or a flag to the existing one.
A patch that:
1) uses always the old CLUSTER method for non-btree indexes and for
expression indexes
2) add a whole set of new functions to tuplesort (as in Gregory Stark's patch)would be rejected "for sure"? Or can be thought as a "better than nothing,
works in 90% cases" patch?
I'm fine with 1), though I wish we didn't have to add all that
boilerplate code 2). I guess it's not a show-stopper.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
I read the thread "Our CLUSTER implementation is pessimal"
http://archives.postgresql.org/pgsql-hackers/2008-08/msg01371.php .I would like to try/integrate that patch as we use CLUSTER a lot on our system.
I was going to try to add the proper cost_index/cost_sort calls to decide which
"path" should be executed, as in:http://archives.postgresql.org/pgsql-hackers/2008-09/msg00517.php
I think I got something up and running to check if a table scan + sort is supposed
to be faster than an index scan for a certain CLUSTER operation.
The way I did it is (I guess...) wrong: I created the elements needed by
get_relation_info, create_seqscan_path, create_index_path, cost_sort.
It has been, obviously, a trial and error approach: I added the member values as
soon as one function call crashed... and I bet I didn't get all the corner cases.
Is there any better way of doing it?
Leonardo
(this is called in copy_heap_data to decide which path to choose:)
static bool use_index_scan(Oid tableOid, Oid indexOid)
{
RelOptInfo *rel;
PlannerInfo *root;
Query *query;
PlannerGlobal *glob;
Path *seqAndSortPath;
IndexPath *indexPath;
RangeTblEntry *rte;
rel = makeNode(RelOptInfo);
rel->reloptkind = RELOPT_BASEREL;
rel->relid = 1;
rel->rtekind = RTE_RELATION;
/* needed by get_relation_info */
glob = makeNode(PlannerGlobal);
/* needed by get_relation_info: */
query = makeNode(Query);
query->resultRelation = 0;
root = makeNode(PlannerInfo);
root->parse = query;
root->glob = glob;
get_relation_info(root, tableOid, false, rel);
seqAndSortPath = create_seqscan_path(NULL, rel);
rel->rows = rel->tuples;
rte = makeNode(RangeTblEntry);
rte->rtekind = RTE_RELATION;
rte->relid = tableOid;
root->simple_rel_array_size = 2;
root->simple_rte_array = (RangeTblEntry **)
palloc0(root->simple_rel_array_size * sizeof(RangeTblEntry *));
root->simple_rte_array[1] = rte;
root->total_table_pages = rel->pages;
indexPath = create_index_path(root, (IndexOptInfo*)(list_head(rel->indexlist)->data.ptr_value), NULL, NULL, ForwardScanDirection, NULL);
cost_sort(seqAndSortPath, root, NULL, seqAndSortPath->total_cost, rel->tuples, rel->width, -1);
return indexPath->path.total_cost < seqAndSortPath->total_cost;
}
Anyone? I'd like some feedback before moving on to do the seq scan + sort in those
CLUSTER cases where "use_index_scan" returns false...
----- Messaggio originale -----
Show quoted text
Da: Leonardo F <m_lists@yahoo.it>
A: pgsql-hackers@postgresql.org
Inviato: Mer 20 gennaio 2010, 18:48:00
Oggetto: Re: [HACKERS] About "Our CLUSTER implementation is pessimal" patchI read the thread "Our CLUSTER implementation is pessimal"
http://archives.postgresql.org/pgsql-hackers/2008-08/msg01371.php .I was going to try to add the proper cost_index/cost_sort calls to decide
which
"path" should be executed, as in:
http://archives.postgresql.org/pgsql-hackers/2008-09/msg00517.php
I think I got something up and running to check if a table scan + sort is
supposed
to be faster than an index scan for a certain CLUSTER operation.The way I did it is (I guess...) wrong: I created the elements needed by
get_relation_info, create_seqscan_path, create_index_path, cost_sort.It has been, obviously, a trial and error approach: I added the member values as
soon as one function call crashed... and I bet I didn't get all the corner
cases.
Is there any better way of doing it?Leonardo
(this is called in copy_heap_data to decide which path to choose:)
static bool use_index_scan(Oid tableOid, Oid indexOid)
{
RelOptInfo *rel;
PlannerInfo *root;
Query *query;
PlannerGlobal *glob;
Path *seqAndSortPath;
IndexPath *indexPath;
RangeTblEntry *rte;rel = makeNode(RelOptInfo);
rel->reloptkind = RELOPT_BASEREL;
rel->relid = 1;
rel->rtekind = RTE_RELATION;/* needed by get_relation_info */
glob = makeNode(PlannerGlobal);/* needed by get_relation_info: */
query = makeNode(Query);
query->resultRelation = 0;root = makeNode(PlannerInfo);
root->parse = query;
root->glob = glob;get_relation_info(root, tableOid, false, rel);
seqAndSortPath = create_seqscan_path(NULL, rel);rel->rows = rel->tuples;
rte = makeNode(RangeTblEntry);
rte->rtekind = RTE_RELATION;
rte->relid = tableOid;root->simple_rel_array_size = 2;
root->simple_rte_array = (RangeTblEntry **)
palloc0(root->simple_rel_array_size * sizeof(RangeTblEntry *));
root->simple_rte_array[1] = rte;root->total_table_pages = rel->pages;
indexPath = create_index_path(root,
(IndexOptInfo*)(list_head(rel->indexlist)->data.ptr_value), NULL, NULL,
ForwardScanDirection, NULL);
cost_sort(seqAndSortPath, root, NULL, seqAndSortPath->total_cost, rel->tuples,
rel->width, -1);return indexPath->path.total_cost < seqAndSortPath->total_cost;
}
Leonardo F <m_lists@yahoo.it> wrote:
Anyone? I'd like some feedback before moving on to do the seq scan + sort in those
CLUSTER cases where "use_index_scan" returns false...
+1 for CLUSTER using sort.
I have a couple of comments for the current implementation:
* Do we need to disable sort-path for tables clustered on a gist index?
* I'd prefer to separate cost calculation routines from create_index_path()
and cost_sort(), rather than using a dummy planner.
Regards,
---
Takahiro Itagaki
NTT Open Source Software Center
one idea could be to actually prepare a query using SPI for "select * from
table order by <cols>" and then peek inside to see which plan was generated.
perhaps you could do this using the existing planner hook.
you might have to watch out for the user's rules or planner hooks (though I
don't think referential integrity triggers take any precautions about those
dangers)
greg
On 20 Jan 2010 17:48, "Leonardo F" <m_lists@yahoo.it> wrote:
I read the thread "Our CLUSTER implementation is pessimal" >
http://archives.postgresql.org/pgsql...
I think I got something up and running to check if a table scan + sort is
supposed
to be faster than an index scan for a certain CLUSTER operation.
The way I did it is (I guess...) wrong: I created the elements needed by
get_relation_info, create_seqscan_path, create_index_path, cost_sort.
It has been, obviously, a trial and error approach: I added the member
values as
soon as one function call crashed... and I bet I didn't get all the corner
cases.
Is there any better way of doing it?
Leonardo
(this is called in copy_heap_data to decide which path to choose:)
static bool use_index_scan(Oid tableOid, Oid indexOid)
{
RelOptInfo *rel;
PlannerInfo *root;
Query *query;
PlannerGlobal *glob;
Path *seqAndSortPath;
IndexPath *indexPath;
RangeTblEntry *rte;
rel = makeNode(RelOptInfo);
rel->reloptkind = RELOPT_BASEREL;
rel->relid = 1;
rel->rtekind = RTE_RELATION;
/* needed by get_relation_info */
glob = makeNode(PlannerGlobal);
/* needed by get_relation_info: */
query = makeNode(Query);
query->resultRelation = 0;
root = makeNode(PlannerInfo);
root->parse = query;
root->glob = glob;
get_relation_info(root, tableOid, false, rel);
seqAndSortPath = create_seqscan_path(NULL, rel);
rel->rows = rel->tuples;
rte = makeNode(RangeTblEntry);
rte->rtekind = RTE_RELATION;
rte->relid = tableOid;
root->simple_rel_array_size = 2;
root->simple_rte_array = (RangeTblEntry **)
palloc0(root->simple_rel_array_size * sizeof(RangeTblEntry *));
root->simple_rte_array[1] = rte;
root->total_table_pages = rel->pages;
indexPath = create_index_path(root,
(IndexOptInfo*)(list_head(rel->indexlist)->data.ptr_value), NULL, NULL,
ForwardScanDirection, NULL);
cost_sort(seqAndSortPath, root, NULL, seqAndSortPath->total_cost,
rel->tuples, rel->width, -1);
return indexPath->path.total_cost < seqAndSortPath->total_cost;
} -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To
make changes to you...
* Do we need to disable sort-path for tables clustered on a gist index?
Yes; as I said in a previous mail, only plain btree indexes (that is, not
custom expression indexes) would have that option (at least in a first
version...)
* I'd prefer to separate cost calculation routines from create_index_path()
and cost_sort(), rather than using a dummy planner.
How? Copy&paste (removing unnecessary code) of the existing
create_index_path() and cost_sort()?
Leonardo
Takahiro Itagaki <itagaki.takahiro@oss.ntt.co.jp> writes:
* I'd prefer to separate cost calculation routines from create_index_path()
and cost_sort(), rather than using a dummy planner.
Don't go that way. The cost functions have enough dependencies on
low-level planner functionality that making them be standalone would be
a serious mess, both immediately and in terms of future maintenance.
(As an example, someday we'll probably get around to having cost_sort
actually look at the specific columns being sorted by, and that's
going to involve a lot of interaction with pathkeys.)
What I do think is that the quoted code snippet has no business being
outside the planner proper. It'd be better to put it in planner.c
or someplace like that.
regards, tom lane
one idea could be to actually prepare a query using SPI for "select * from table order by <cols>" and then peek inside
to see which plan was generated.
I like that!!!
Here's a first attempt, it looks like it's working...
(I still have to skip non-btree indexes and expression indexes, plus
add a ASC/DESC to the select)
static bool use_index_scan(Relation OldHeap, Oid indexOid)
{
HeapTuple indexTuple;
Form_pg_index indexForm;
struct _SPI_plan *spiPlan;
char st[(NAMEDATALEN+1)*INDEX_MAX_KEYS+NAMEDATALEN+100];
int i;
TupleDesc oldTupDesc;
bool retval = true;
PlannedStmt *plan;
CachedPlanSource *cps;
oldTupDesc = RelationGetDescr(OldHeap);
sprintf(st, "select * from %s order by ", OldHeap->rd_rel->relname.data);
indexTuple = SearchSysCache(INDEXRELID, ObjectIdGetDatum(indexOid),0, 0, 0);
if (!HeapTupleIsValid(indexTuple))
elog(ERROR, "cache lookup failed for index %u", indexOid);
indexForm = (Form_pg_index) GETSTRUCT(indexTuple);
for (i = 0; i < indexForm->indnatts; i++)
{
strcat(st, oldTupDesc->attrs[indexForm->indkey.values[i]-1]->attname.data);
if (i+1 < indexForm->indnatts)
{
strcat(st, ",");
}
}
ReleaseSysCache(indexTuple);
if (SPI_connect() != SPI_OK_CONNECT)
return false;
spiPlan = SPI_prepare(st, 0, NULL);
if (spiPlan == NULL)
{
SPI_finish();
return false;
}
cps = (CachedPlanSource*)(list_head(spiPlan->plancache_list)->data.ptr_value);
plan = (PlannedStmt*)(list_head(cps->plan->stmt_list)->data.ptr_value);
if (IsA(plan->planTree, Sort))
{
retval = false;
}
SPI_freeplan(spiPlan);
SPI_finish();
return retval;
}
Leonardo F <m_lists@yahoo.it> writes:
one idea could be to actually prepare a query using SPI for "select * from table order by <cols>" and then peek inside
to see which plan was generated.
I like that!!!
Here's a first attempt, it looks like it's working...
(I still have to skip non-btree indexes and expression indexes, plus
add a ASC/DESC to the select)
By the time you make this actually work in all cases, it's probably
going to be more of a mess than the other way; not to mention that it
doesn't work *at all* without violating SPI internals.
regards, tom lane
By the time you make this actually work in all cases, it's probably
going to be more of a mess than the other way;
I meant to add only ASC/DESC; I would leave all other cases
(non-btrees, custom expression btrees) to use the old index-scan method.
not to mention that it
doesn't work *at all* without violating SPI internals.
You lost me there...
Leonardo F <m_lists@yahoo.it> writes:
By the time you make this actually work in all cases, it's probably
going to be more of a mess than the other way;
I meant to add only ASC/DESC; I would leave all other cases
(non-btrees, custom expression btrees) to use the old index-scan method.
That hardly seems acceptable.
not to mention that it
doesn't work *at all* without violating SPI internals.
You lost me there...
You're poking into a data structure you shouldn't be poking into:
/* Plans are opaque structs for standard users of SPI */
typedef struct _SPI_plan *SPIPlanPtr;
I hardly think that keeping yourself at arm's length from the planner
by getting cozy with SPI internals is a net improvement in modularity.
regards, tom lane
I meant to add only ASC/DESC; I would leave all other cases
(non-btrees, custom expression btrees) to use the old index-scan method.That hardly seems acceptable.
Well I brought up that in an earlier post:
http://old.nabble.com/Re%3A-About-%22Our-CLUSTER-implementation-is-pessimal%22-patch-p27179107.html
I hoped that since people mostly (>95%?) use plain btree indexes,
a patch that helped CLUSTER with using such indexes would be fine
(at least at first...). I guess that a patch that deals with all other types of
indexes would be way more complicated (not at the "planning" stage,
but in the scan+sort phase)?
I hardly think that keeping yourself at arm's length from the planner
by getting cozy with SPI internals is a net improvement in modularity.
So you think that code snippet that I sent earlier (the function that uses
create_index_path etc) could be put in planner.c (almost) as it is? It looked
clumsy to me (I liked the SPI code better)
Leonardo
Leonardo F <m_lists@yahoo.it> writes:
I hoped that since people mostly (>95%?) use plain btree indexes,
a patch that helped CLUSTER with using such indexes would be fine
(at least at first...). I guess that a patch that deals with all other types of
indexes would be way more complicated (not at the "planning" stage,
but in the scan+sort phase)?
Well, the expression cases would be more likely to cost more if
implemented as a sort, but that doesn't mean that a sort couldn't be a
win. Besides, even if you blow off the expression case, what about
nulls first/last, nondefault opclasses, etc?
regards, tom lane
Well, the expression cases would be more likely to cost more if
implemented as a sort, but that doesn't mean that a sort couldn't be a
win. Besides, even if you blow off the expression case, what about
nulls first/last, nondefault opclasses, etc?
Ok, let's split the problem in 2 parts:
a) "choosing the right path"
b) "using seq scan + sort in case the planner (somehow) said it's faster"
You're right: for a) nondefault opclasses and other things would make the
SPI version "wrong". So let's stick for the moment with the cost_sort
create_index_path etc calls for a). I think that the calls to create_index_path
would also deal with every other possible index type (something that's
impossible with the SPI version).
For b):
I'm using the code in
http://archives.postgresql.org/pgsql-hackers/2008-08/msg01371.php
That doesn't deal with expression indexes, nor with anything that is not
a btree index. But it's much faster on what people use the most (and not
slower on anything else).
So my proposal would be: do the test seq_scan vs sort/index_scan only for
regular btree index, and integrate that test in the planner.
That doesn't mean that sometime in the future we're not going to have full
support for all index types in seq scan + sort CLUSTER... but I would like
to have something that works faster on what people use, and not slower
in the other cases without waiting ages to have the "whole" thing...
On Thu, Jan 21, 2010 at 1:28 PM, Leonardo F <m_lists@yahoo.it> wrote:
Well, the expression cases would be more likely to cost more if
implemented as a sort, but that doesn't mean that a sort couldn't be a
win. Besides, even if you blow off the expression case, what about
nulls first/last, nondefault opclasses, etc?Ok, let's split the problem in 2 parts:
a) "choosing the right path"
b) "using seq scan + sort in case the planner (somehow) said it's faster"You're right: for a) nondefault opclasses and other things would make the
SPI version "wrong". So let's stick for the moment with the cost_sort
create_index_path etc calls for a). I think that the calls to create_index_path
would also deal with every other possible index type (something that's
impossible with the SPI version).For b):
I'm using the code in
http://archives.postgresql.org/pgsql-hackers/2008-08/msg01371.php
That doesn't deal with expression indexes, nor with anything that is not
a btree index. But it's much faster on what people use the most (and not
slower on anything else).So my proposal would be: do the test seq_scan vs sort/index_scan only for
regular btree index, and integrate that test in the planner.That doesn't mean that sometime in the future we're not going to have full
support for all index types in seq scan + sort CLUSTER... but I would like
to have something that works faster on what people use, and not slower
in the other cases without waiting ages to have the "whole" thing...
Keep in mind that this patch was after the deadline for 9.0, so there
is probably not a huge rush to get this done.
...Robert
Tom Lane <tgl@sss.pgh.pa.us> wrote:
What I do think is that the quoted code snippet has no business being
outside the planner proper. It'd be better to put it in planner.c
or someplace like that.
Ah, I see. My concern was the dummy planner approach is using internal
functions of planner. It would be better if planner module exports
a cost estimate function for cluster.
Regards,
---
Takahiro Itagaki
NTT Open Source Software Center
So my proposal would be: do the test seq_scan vs sort/index_scan only for
regular btree index, and integrate that test in the planner.Keep in mind that this patch was after the deadline for 9.0, so there
is probably not a huge rush to get this done.
That's true; I'll try to get the whole thing done then...