About &quot;Our CLUSTER implementation is pessimal&quot; patch

heikki.linnakangas@enterprisedb.com

almost 16 years ago

In reply to: Leonardo F (#1)

Re: About "Our CLUSTER implementation is pessimal" patch

Leonardo F wrote:

I read the thread "Our CLUSTER implementation is pessimal" http://archives.postgresql.org/pgsql-hackers/2008-08/msg01371.php .

I would like to try/integrate that patch as we use CLUSTER a lot on our system.

Great!

About that patch:

1) would it be possible to use the tuplesort_*tupleslot set of functions instead of writing new ones for HeapTuple? That is: is it that difficult/impossible/nonsense to construct TupleTableSlot from HeapTuple and use those?

Yeah, I think you could do that, I agree it feels better that way.
You'll still need new copytup and comparetup functions, though, to deal
with HeapTupleHeaders instead of MinimalTuples, or modify the existing
ones to handle both. And some way to indicate that you want to preserve
the visibility information when you create the tuplesort, maybe a new
parameter to tuplesort_begin_heap().

2) The patch doesn't check "HeapTupleSatisfiesVacuum" before passing it to tuplesort_putrawtuple: would it be reasonable to check the "isdead" flag before calling tuplesort_putrawtuple for each tuple?

Yeah, seems reasonable, to avoid sorting dead tuples unnecessarily.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

m_lists@yahoo.it

almost 16 years ago

In reply to: Heikki Linnakangas (#2)

Re: About "Our CLUSTER implementation is pessimal" patch

Yeah, I think you could do that, I agree it feels better that way.
You'll still need new copytup and comparetup functions, though, to deal
with HeapTupleHeaders instead of MinimalTuples, or modify the existing
ones to handle both.

You meant HeapTuple, not HeapTupleHeaders, right?

Mmh, didn't think of those two functions; I might as well start with Gregory
Stark's patch (that is: using HeapTuple)

And some way to indicate that you want to preserve
the visibility information when you create the tuplesort, maybe a new
parameter to tuplesort_begin_heap().

I guess that using Gregory Stark's patch there's no need for it, since it uses
HeapTuples, right?

A patch that:

1) uses always the old CLUSTER method for non-btree indexes and for
expression indexes
2) add a whole set of new functions to tuplesort (as in Gregory Stark's patch)

would be rejected "for sure"? Or can be thought as a "better than nothing,
works in 90% cases" patch?

Leonardo

heikki.linnakangas@enterprisedb.com

almost 16 years ago

In reply to: Leonardo F (#3)

Re: About "Our CLUSTER implementation is pessimal" patch

Leonardo F wrote:

Yeah, I think you could do that, I agree it feels better that way.
You'll still need new copytup and comparetup functions, though, to deal
with HeapTupleHeaders instead of MinimalTuples, or modify the existing
ones to handle both.

You meant HeapTuple, not HeapTupleHeaders, right?

No, I did mean HeapTupleHeader. MinimalTuple struct is cut-down version
HeapTupleHeader, while HeapTuple is structure that holds a pointer to
HeapTupleHeader + some extra information. SortTuple takes the role of
HeapTUple in tuplesort.c. A bit confusing, yes.

That said, I didn't really look closely, maybe I'm missing something and
HeapTuple is in fact the right struct to pass around.

And some way to indicate that you want to preserve
the visibility information when you create the tuplesort, maybe a new
parameter to tuplesort_begin_heap().

I guess that using Gregory Stark's patch there's no need for it, since it uses
HeapTuples, right?

Hmm, you still need to set different comparison function in
Tuplesortstate->comparetup, so you'll still need a separate begin()
function too, or a flag to the existing one.

A patch that:

1) uses always the old CLUSTER method for non-btree indexes and for
expression indexes
2) add a whole set of new functions to tuplesort (as in Gregory Stark's patch)

would be rejected "for sure"? Or can be thought as a "better than nothing,
works in 90% cases" patch?

I'm fine with 1), though I wish we didn't have to add all that
boilerplate code 2). I guess it's not a show-stopper.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

m_lists@yahoo.it

almost 16 years ago

In reply to: Leonardo F (#1)

Re: About "Our CLUSTER implementation is pessimal" patch

I read the thread "Our CLUSTER implementation is pessimal"
http://archives.postgresql.org/pgsql-hackers/2008-08/msg01371.php .

I would like to try/integrate that patch as we use CLUSTER a lot on our system.

I was going to try to add the proper cost_index/cost_sort calls to decide which
"path" should be executed, as in:

http://archives.postgresql.org/pgsql-hackers/2008-09/msg00517.php

I think I got something up and running to check if a table scan + sort is supposed
to be faster than an index scan for a certain CLUSTER operation.

The way I did it is (I guess...) wrong: I created the elements needed by
get_relation_info, create_seqscan_path, create_index_path, cost_sort.

It has been, obviously, a trial and error approach: I added the member values as
soon as one function call crashed... and I bet I didn't get all the corner cases.
Is there any better way of doing it?

Leonardo

(this is called in copy_heap_data to decide which path to choose:)

static bool use_index_scan(Oid tableOid, Oid indexOid)
{
RelOptInfo *rel;
PlannerInfo *root;
Query *query;
PlannerGlobal *glob;
Path *seqAndSortPath;
IndexPath *indexPath;
RangeTblEntry *rte;

rel = makeNode(RelOptInfo);
rel->reloptkind = RELOPT_BASEREL;
rel->relid = 1;
rel->rtekind = RTE_RELATION;

/* needed by get_relation_info */
glob = makeNode(PlannerGlobal);

/* needed by get_relation_info: */
query = makeNode(Query);
query->resultRelation = 0;

root = makeNode(PlannerInfo);

root->parse = query;
root->glob = glob;

get_relation_info(root, tableOid, false, rel);
seqAndSortPath = create_seqscan_path(NULL, rel);

rel->rows = rel->tuples;

rte = makeNode(RangeTblEntry);
rte->rtekind = RTE_RELATION;
rte->relid = tableOid;

root->simple_rel_array_size = 2;
root->simple_rte_array = (RangeTblEntry **)
palloc0(root->simple_rel_array_size * sizeof(RangeTblEntry *));
root->simple_rte_array[1] = rte;

root->total_table_pages = rel->pages;

indexPath = create_index_path(root, (IndexOptInfo*)(list_head(rel->indexlist)->data.ptr_value), NULL, NULL, ForwardScanDirection, NULL);
cost_sort(seqAndSortPath, root, NULL, seqAndSortPath->total_cost, rel->tuples, rel->width, -1);

return indexPath->path.total_cost < seqAndSortPath->total_cost;
}

m_lists@yahoo.it

almost 16 years ago

In reply to: Leonardo F (#5)

Re: About "Our CLUSTER implementation is pessimal" patch

Anyone? I'd like some feedback before moving on to do the seq scan + sort in those
CLUSTER cases where "use_index_scan" returns false...

----- Messaggio originale -----

Show quoted text

Da: Leonardo F <m_lists@yahoo.it>
A: pgsql-hackers@postgresql.org
Inviato: Mer 20 gennaio 2010, 18:48:00
Oggetto: Re: [HACKERS] About "Our CLUSTER implementation is pessimal" patch

I read the thread "Our CLUSTER implementation is pessimal"
http://archives.postgresql.org/pgsql-hackers/2008-08/msg01371.php .

I was going to try to add the proper cost_index/cost_sort calls to decide

which

"path" should be executed, as in:

http://archives.postgresql.org/pgsql-hackers/2008-09/msg00517.php

I think I got something up and running to check if a table scan + sort is
supposed
to be faster than an index scan for a certain CLUSTER operation.

The way I did it is (I guess...) wrong: I created the elements needed by
get_relation_info, create_seqscan_path, create_index_path, cost_sort.

It has been, obviously, a trial and error approach: I added the member values as
soon as one function call crashed... and I bet I didn't get all the corner
cases.
Is there any better way of doing it?

Leonardo

(this is called in copy_heap_data to decide which path to choose:)

static bool use_index_scan(Oid tableOid, Oid indexOid)
{
RelOptInfo *rel;
PlannerInfo *root;
Query *query;
PlannerGlobal *glob;
Path *seqAndSortPath;
IndexPath *indexPath;
RangeTblEntry *rte;

rel = makeNode(RelOptInfo);
rel->reloptkind = RELOPT_BASEREL;
rel->relid = 1;
rel->rtekind = RTE_RELATION;

/* needed by get_relation_info */
glob = makeNode(PlannerGlobal);

/* needed by get_relation_info: */
query = makeNode(Query);
query->resultRelation = 0;

root = makeNode(PlannerInfo);

root->parse = query;
root->glob = glob;

get_relation_info(root, tableOid, false, rel);
seqAndSortPath = create_seqscan_path(NULL, rel);

rel->rows = rel->tuples;

rte = makeNode(RangeTblEntry);
rte->rtekind = RTE_RELATION;
rte->relid = tableOid;

root->simple_rel_array_size = 2;
root->simple_rte_array = (RangeTblEntry **)
palloc0(root->simple_rel_array_size * sizeof(RangeTblEntry *));
root->simple_rte_array[1] = rte;

root->total_table_pages = rel->pages;

indexPath = create_index_path(root,
(IndexOptInfo*)(list_head(rel->indexlist)->data.ptr_value), NULL, NULL,
ForwardScanDirection, NULL);
cost_sort(seqAndSortPath, root, NULL, seqAndSortPath->total_cost, rel->tuples,
rel->width, -1);

return indexPath->path.total_cost < seqAndSortPath->total_cost;
}

itagaki.takahiro@oss.ntt.co.jp

almost 16 years ago

In reply to: Leonardo F (#6)

Re: About "Our CLUSTER implementation is pessimal" patch

Leonardo F <m_lists@yahoo.it> wrote:

Anyone? I'd like some feedback before moving on to do the seq scan + sort in those
CLUSTER cases where "use_index_scan" returns false...

+1 for CLUSTER using sort.

I have a couple of comments for the current implementation:
* Do we need to disable sort-path for tables clustered on a gist index?
* I'd prefer to separate cost calculation routines from create_index_path()
and cost_sort(), rather than using a dummy planner.

Regards,
---
Takahiro Itagaki
NTT Open Source Software Center

Greg Stark

stark@mit.edu

almost 16 years ago

In reply to: Leonardo F (#5)

Re: About "Our CLUSTER implementation is pessimal" patch

one idea could be to actually prepare a query using SPI for "select * from
table order by <cols>" and then peek inside to see which plan was generated.
perhaps you could do this using the existing planner hook.

you might have to watch out for the user's rules or planner hooks (though I
don't think referential integrity triggers take any precautions about those
dangers)

greg

On 20 Jan 2010 17:48, "Leonardo F" <m_lists@yahoo.it> wrote:

I read the thread "Our CLUSTER implementation is pessimal" >

http://archives.postgresql.org/pgsql...
I think I got something up and running to check if a table scan + sort is
supposed
to be faster than an index scan for a certain CLUSTER operation.

The way I did it is (I guess...) wrong: I created the elements needed by
get_relation_info, create_seqscan_path, create_index_path, cost_sort.

It has been, obviously, a trial and error approach: I added the member
values as
soon as one function call crashed... and I bet I didn't get all the corner
cases.
Is there any better way of doing it?

Leonardo

(this is called in copy_heap_data to decide which path to choose:)

static bool use_index_scan(Oid tableOid, Oid indexOid)
{
RelOptInfo *rel;
PlannerInfo *root;
Query *query;
PlannerGlobal *glob;
Path *seqAndSortPath;
IndexPath *indexPath;
RangeTblEntry *rte;

rel = makeNode(RelOptInfo);
rel->reloptkind = RELOPT_BASEREL;
rel->relid = 1;
rel->rtekind = RTE_RELATION;

/* needed by get_relation_info */
glob = makeNode(PlannerGlobal);

/* needed by get_relation_info: */
query = makeNode(Query);
query->resultRelation = 0;

root = makeNode(PlannerInfo);

root->parse = query;
root->glob = glob;

get_relation_info(root, tableOid, false, rel);
seqAndSortPath = create_seqscan_path(NULL, rel);

rel->rows = rel->tuples;

rte = makeNode(RangeTblEntry);
rte->rtekind = RTE_RELATION;
rte->relid = tableOid;

root->simple_rel_array_size = 2;
root->simple_rte_array = (RangeTblEntry **)
palloc0(root->simple_rel_array_size * sizeof(RangeTblEntry *));
root->simple_rte_array[1] = rte;

root->total_table_pages = rel->pages;

indexPath = create_index_path(root,
(IndexOptInfo*)(list_head(rel->indexlist)->data.ptr_value), NULL, NULL,
ForwardScanDirection, NULL);
cost_sort(seqAndSortPath, root, NULL, seqAndSortPath->total_cost,
rel->tuples, rel->width, -1);

return indexPath->path.total_cost < seqAndSortPath->total_cost;

} -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To
make changes to you...

m_lists@yahoo.it

almost 16 years ago

In reply to: Takahiro Itagaki (#7)

Re: About "Our CLUSTER implementation is pessimal" patch

* Do we need to disable sort-path for tables clustered on a gist index?

Yes; as I said in a previous mail, only plain btree indexes (that is, not
custom expression indexes) would have that option (at least in a first
version...)

* I'd prefer to separate cost calculation routines from create_index_path()
and cost_sort(), rather than using a dummy planner.

How? Copy&paste (removing unnecessary code) of the existing
create_index_path() and cost_sort()?

Leonardo

#10

tgl@sss.pgh.pa.us

almost 16 years ago

In reply to: Takahiro Itagaki (#7)

Re: About "Our CLUSTER implementation is pessimal" patch

Takahiro Itagaki <itagaki.takahiro@oss.ntt.co.jp> writes:

* I'd prefer to separate cost calculation routines from create_index_path()
and cost_sort(), rather than using a dummy planner.

Don't go that way. The cost functions have enough dependencies on
low-level planner functionality that making them be standalone would be
a serious mess, both immediately and in terms of future maintenance.
(As an example, someday we'll probably get around to having cost_sort
actually look at the specific columns being sorted by, and that's
going to involve a lot of interaction with pathkeys.)

What I do think is that the quoted code snippet has no business being
outside the planner proper. It'd be better to put it in planner.c
or someplace like that.

regards, tom lane

#11

m_lists@yahoo.it

almost 16 years ago

In reply to: Greg Stark (#8)

Re: About "Our CLUSTER implementation is pessimal" patch

one idea could be to actually prepare a query using SPI for "select * from table order by <cols>" and then peek inside
to see which plan was generated.

I like that!!!
Here's a first attempt, it looks like it's working...
(I still have to skip non-btree indexes and expression indexes, plus
add a ASC/DESC to the select)

static bool use_index_scan(Relation OldHeap, Oid indexOid)
{
HeapTuple indexTuple;
Form_pg_index indexForm;
struct _SPI_plan *spiPlan;
char st[(NAMEDATALEN+1)*INDEX_MAX_KEYS+NAMEDATALEN+100];
int i;
TupleDesc oldTupDesc;
bool retval = true;
PlannedStmt *plan;
CachedPlanSource *cps;

oldTupDesc = RelationGetDescr(OldHeap);

sprintf(st, "select * from %s order by ", OldHeap->rd_rel->relname.data);
indexTuple = SearchSysCache(INDEXRELID, ObjectIdGetDatum(indexOid),0, 0, 0);

if (!HeapTupleIsValid(indexTuple))
elog(ERROR, "cache lookup failed for index %u", indexOid);

indexForm = (Form_pg_index) GETSTRUCT(indexTuple);

for (i = 0; i < indexForm->indnatts; i++)
{
strcat(st, oldTupDesc->attrs[indexForm->indkey.values[i]-1]->attname.data);
if (i+1 < indexForm->indnatts)
{
strcat(st, ",");
}
}

ReleaseSysCache(indexTuple);

if (SPI_connect() != SPI_OK_CONNECT)
return false;

spiPlan = SPI_prepare(st, 0, NULL);
if (spiPlan == NULL)
{
SPI_finish();
return false;
}

cps = (CachedPlanSource*)(list_head(spiPlan->plancache_list)->data.ptr_value);
plan = (PlannedStmt*)(list_head(cps->plan->stmt_list)->data.ptr_value);
if (IsA(plan->planTree, Sort))
{
retval = false;
}

SPI_freeplan(spiPlan);
SPI_finish();

return retval;
}

#12

tgl@sss.pgh.pa.us

almost 16 years ago

In reply to: Leonardo F (#11)

Re: About "Our CLUSTER implementation is pessimal" patch

Leonardo F <m_lists@yahoo.it> writes:

one idea could be to actually prepare a query using SPI for "select * from table order by <cols>" and then peek inside
to see which plan was generated.

I like that!!!
Here's a first attempt, it looks like it's working...
(I still have to skip non-btree indexes and expression indexes, plus
add a ASC/DESC to the select)

By the time you make this actually work in all cases, it's probably
going to be more of a mess than the other way; not to mention that it
doesn't work *at all* without violating SPI internals.

regards, tom lane

#13

m_lists@yahoo.it

almost 16 years ago

In reply to: Tom Lane (#12)

Re: About "Our CLUSTER implementation is pessimal" patch

By the time you make this actually work in all cases, it's probably
going to be more of a mess than the other way;

I meant to add only ASC/DESC; I would leave all other cases
(non-btrees, custom expression btrees) to use the old index-scan method.

not to mention that it
doesn't work *at all* without violating SPI internals.

You lost me there...

#14

tgl@sss.pgh.pa.us

almost 16 years ago

In reply to: Leonardo F (#13)

Re: About "Our CLUSTER implementation is pessimal" patch

Leonardo F <m_lists@yahoo.it> writes:

By the time you make this actually work in all cases, it's probably
going to be more of a mess than the other way;

I meant to add only ASC/DESC; I would leave all other cases
(non-btrees, custom expression btrees) to use the old index-scan method.

That hardly seems acceptable.

not to mention that it
doesn't work *at all* without violating SPI internals.

You lost me there...

You're poking into a data structure you shouldn't be poking into:

/* Plans are opaque structs for standard users of SPI */
typedef struct _SPI_plan *SPIPlanPtr;

I hardly think that keeping yourself at arm's length from the planner
by getting cozy with SPI internals is a net improvement in modularity.

regards, tom lane

#15

http://old.nabble.com/Re%3A-About-%22Our-CLUSTER-implementation-is-pessimal%22-patch-p27179107.html

m_lists@yahoo.it

almost 16 years ago

In reply to: Tom Lane (#14)

Re: About "Our CLUSTER implementation is pessimal" patch

I meant to add only ASC/DESC; I would leave all other cases
(non-btrees, custom expression btrees) to use the old index-scan method.

That hardly seems acceptable.

Well I brought up that in an earlier post:

I hoped that since people mostly (>95%?) use plain btree indexes,
a patch that helped CLUSTER with using such indexes would be fine
(at least at first...). I guess that a patch that deals with all other types of
indexes would be way more complicated (not at the "planning" stage,
but in the scan+sort phase)?

I hardly think that keeping yourself at arm's length from the planner
by getting cozy with SPI internals is a net improvement in modularity.

So you think that code snippet that I sent earlier (the function that uses
create_index_path etc) could be put in planner.c (almost) as it is? It looked
clumsy to me (I liked the SPI code better)

Leonardo

#16

tgl@sss.pgh.pa.us

almost 16 years ago

In reply to: Leonardo F (#15)

Re: About "Our CLUSTER implementation is pessimal" patch

Leonardo F <m_lists@yahoo.it> writes:

I hoped that since people mostly (>95%?) use plain btree indexes,
a patch that helped CLUSTER with using such indexes would be fine
(at least at first...). I guess that a patch that deals with all other types of
indexes would be way more complicated (not at the "planning" stage,
but in the scan+sort phase)?

Well, the expression cases would be more likely to cost more if
implemented as a sort, but that doesn't mean that a sort couldn't be a
win. Besides, even if you blow off the expression case, what about
nulls first/last, nondefault opclasses, etc?

regards, tom lane

#17

http://archives.postgresql.org/pgsql-hackers/2008-08/msg01371.php

m_lists@yahoo.it

almost 16 years ago

In reply to: Tom Lane (#16)

Re: About "Our CLUSTER implementation is pessimal" patch

Well, the expression cases would be more likely to cost more if
implemented as a sort, but that doesn't mean that a sort couldn't be a
win. Besides, even if you blow off the expression case, what about
nulls first/last, nondefault opclasses, etc?

Ok, let's split the problem in 2 parts:

a) "choosing the right path"
b) "using seq scan + sort in case the planner (somehow) said it's faster"

You're right: for a) nondefault opclasses and other things would make the
SPI version "wrong". So let's stick for the moment with the cost_sort
create_index_path etc calls for a). I think that the calls to create_index_path
would also deal with every other possible index type (something that's
impossible with the SPI version).

For b):

I'm using the code in

That doesn't deal with expression indexes, nor with anything that is not
a btree index. But it's much faster on what people use the most (and not
slower on anything else).

So my proposal would be: do the test seq_scan vs sort/index_scan only for
regular btree index, and integrate that test in the planner.

That doesn't mean that sometime in the future we're not going to have full
support for all index types in seq scan + sort CLUSTER... but I would like
to have something that works faster on what people use, and not slower
in the other cases without waiting ages to have the "whole" thing...

#18

robertmhaas@gmail.com

almost 16 years ago

In reply to: Leonardo F (#17)

Re: About "Our CLUSTER implementation is pessimal" patch

On Thu, Jan 21, 2010 at 1:28 PM, Leonardo F <m_lists@yahoo.it> wrote:

Well, the expression cases would be more likely to cost more if
implemented as a sort, but that doesn't mean that a sort couldn't be a
win. Besides, even if you blow off the expression case, what about
nulls first/last, nondefault opclasses, etc?

Ok, let's split the problem in 2 parts:

a) "choosing the right path"
b) "using seq scan + sort in case the planner (somehow) said it's faster"

You're right: for a) nondefault opclasses and other things would make the
SPI version "wrong". So let's stick for the moment with the cost_sort
create_index_path etc calls for a). I think that the calls to create_index_path
would also deal with every other possible index type (something that's
impossible with the SPI version).

For b):

I'm using the code in

http://archives.postgresql.org/pgsql-hackers/2008-08/msg01371.php

That doesn't deal with expression indexes, nor with anything that is not
a btree index. But it's much faster on what people use the most (and not
slower on anything else).

So my proposal would be: do the test seq_scan vs sort/index_scan only for
regular btree index, and integrate that test in the planner.

That doesn't mean that sometime in the future we're not going to have full
support for all index types in seq scan + sort CLUSTER... but I would like
to have something that works faster on what people use, and not slower
in the other cases without waiting ages to have the "whole" thing...

Keep in mind that this patch was after the deadline for 9.0, so there
is probably not a huge rush to get this done.

...Robert

#19

itagaki.takahiro@oss.ntt.co.jp

almost 16 years ago

In reply to: Tom Lane (#10)

Re: About "Our CLUSTER implementation is pessimal" patch

Tom Lane <tgl@sss.pgh.pa.us> wrote:

What I do think is that the quoted code snippet has no business being
outside the planner proper. It'd be better to put it in planner.c
or someplace like that.

Ah, I see. My concern was the dummy planner approach is using internal
functions of planner. It would be better if planner module exports
a cost estimate function for cluster.

Regards,
---
Takahiro Itagaki
NTT Open Source Software Center

#20

m_lists@yahoo.it

almost 16 years ago

In reply to: Robert Haas (#18)

Re: About "Our CLUSTER implementation is pessimal" patch

So my proposal would be: do the test seq_scan vs sort/index_scan only for
regular btree index, and integrate that test in the planner.

Keep in mind that this patch was after the deadline for 9.0, so there
is probably not a huge rush to get this done.

That's true; I'll try to get the whole thing done then...

#21

Greg Stark

stark@mit.edu

almost 16 years ago

In reply to: Tom Lane (#14)

Re: About "Our CLUSTER implementation is pessimal" patch

On Thu, Jan 21, 2010 at 4:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

You're poking into a data structure you shouldn't be poking into:

/* Plans are opaque structs for standard users of SPI */
typedef struct _SPI_plan *SPIPlanPtr;

I hardly think that keeping yourself at arm's length from the planner
by getting cozy with SPI internals is a net improvement in modularity.

You could do it without poking into the SPI structure. You could
define a planner_hook which calls the regular planner and then just
use SPI to drive the parser etc and invoke your planner hook. The
planner hook could inspect the plan, even modify it.

One of the plans I considered early on which I skipped on because I
thought it might get too complex was to add special syntax like
"SELECT __heaptuple__ from table ORDER BY ..." and then just run it
through the regular executor (and do some unspecified hackery around
the visibility rules).

Part of the reason I'm interested in using the regular planner as much
as possible is that if you have a partial index it could be a huge win
to use some unexpected different index to scan the part of the table
you need.

--
greg

#22

m_lists@yahoo.it

almost 16 years ago

In reply to: Leonardo F (#20)

Re: About "Our CLUSTER implementation is pessimal" patch

So, if I'm not mistaken:

hash indexes -> can't be used in CLUSTER
gin indexes -> can't be used in CLUSTER

that leaves:

btree -> ok
expression btree -> I have to find a way to compute the expression for
each tuple: hints?
gist -> how can I get something "comparable" by tuplesort? Or should I rule it
out from the seq scan + sort path?

Leonardo

#23

tgl@sss.pgh.pa.us

almost 16 years ago

In reply to: Leonardo F (#22)

Re: About "Our CLUSTER implementation is pessimal" patch

Leonardo F <m_lists@yahoo.it> writes:

gist -> how can I get something "comparable" by tuplesort? Or should I rule it
out from the seq scan + sort path?

Rule it out. Note you should be looking at pg_am.amcanorder, not
hardwiring knowledge of particular index types.

regards, tom lane

#24

m_lists@yahoo.it

almost 16 years ago

In reply to: Tom Lane (#23)

Re: About "Our CLUSTER implementation is pessimal" patch

Note you should be looking at pg_am.amcanorder, not
hardwiring knowledge of particular index types.

Ok.

Would it make sense to use

FormIndexDatum

to get the index value to be used by tuplesort?

I'm having trouble avoiding the call to _bt_mkscankey_nodata
to get the scanKeys... that relates to btree only, I can't find
the "general" function...

#25

tgl@sss.pgh.pa.us

almost 16 years ago

In reply to: Leonardo F (#24)

Re: About "Our CLUSTER implementation is pessimal" patch

Leonardo F <m_lists@yahoo.it> writes:

Would it make sense to use
FormIndexDatum
to get the index value to be used by tuplesort?

That would probably work. You might want to look at the code associated
with the recently-added exclusion constraint feature; that also has a
need to reproduce index entries sometimes.

regards, tom lane

#26

m_lists@yahoo.it

almost 16 years ago

In reply to: Tom Lane (#25)

Re: About "Our CLUSTER implementation is pessimal" patch

Rule it out. Note you should be looking at pg_am.amcanorder, not
hardwiring knowledge of particular index types.

Sorry, I replied "ok" too fast...

I can look at pg_am.amcanorder, but I would still need the ScanKey to be used

by tuplesort; and I can't find any other way of doing it than calling
_bt_mkscankey_nodata, which is btree-specific.

I guess either:

- add another function to the list of "Index Access Method Functions", something
that returns the ScanKey in case pg_am.amcanorder is true

- hardwiring the fact that the only way to seq scan + sort in CLUSTER is using
a btree... hence the call to _bt_mkscankey_nodata

But maybe there's another way of doing it, I don't know the code enough

Leonardo

Import Notes

Resolved by subject fallback

#27

tgl@sss.pgh.pa.us

almost 16 years ago

In reply to: Leonardo F (#26)

Re: About "Our CLUSTER implementation is pessimal" patch

Leonardo F <m_lists@yahoo.it> writes:

Rule it out. Note you should be looking at pg_am.amcanorder, not
hardwiring knowledge of particular index types.

I can look at pg_am.amcanorder, but I would still need the ScanKey to be used
by tuplesort; and I can't find any other way of doing it than calling
_bt_mkscankey_nodata, which is btree-specific.

Well, actually, it's not *quite* as btree specific as all that. Note
the fine print in section 50.3:

: Some access methods return index entries in a well-defined order,
: others do not. If entries are returned in sorted order, the access
: method should set pg_am.amcanorder true to indicate that it supports
: ordered scans. All such access methods must use btree-compatible
: strategy numbers for their equality and ordering operators.

So in principle you could probably do something that avoided any
"official" dependency on btree. Whether it's worth doing in practice is
pretty dubious though. I agree that calling a routine that claims to be
btree-specific would be best done only after making a specific test for
BTREE_AM_OID. If we ever get another index type that supports ordered
scans, it'll be time enough to worry about cases like this.

BTW, I think you could use tuplesort_begin_index_btree() rather than
touching _bt_mkscankey_nodata directly. Doesn't affect the fundamental
problem, but you might as well use the most convenient API.

regards, tom lane

#28

m_lists@yahoo.it

almost 16 years ago

In reply to: Tom Lane (#27)

Re: About "Our CLUSTER implementation is pessimal" patch

If we ever get another index type that supports ordered
scans, it'll be time enough to worry about cases like this.

BTW, I think you could use tuplesort_begin_index_btree() rather than
touching _bt_mkscankey_nodata directly.

well I created my own tuplesort_begin_rawheap method (copied from:
http://archives.postgresql.org/pgsql-hackers/2008-08/msg01371.php ).
From there I call _bt_mkscankey_nodata (as tuplesort_begin_index_btree()
does), plus I set up everything else I'm going to need in tuplesort.

Another question:

why is IndexInfo.ii_Expressions a list? How can an index have more than
one expression? Sorry if it's a stupid question, but I'm not familiar with
index expressions.

I think I'm almost there (some very stupid tests pass). I'll try to submit a
patch soon to understand if I'm going in the right direction.

Leonardo

#29

heikki.linnakangas@enterprisedb.com

almost 16 years ago

In reply to: Leonardo F (#28)

Re: About "Our CLUSTER implementation is pessimal" patch

Leonardo F wrote:

why is IndexInfo.ii_Expressions a list? How can an index have more than
one expression? Sorry if it's a stupid question, but I'm not familiar with
index expressions.

Consider multi-column indexes, ie:

CREATE INDEX i_foo ON foo (length(a), length(b));

Maybe you're confusing expression indexes with partial indexes? The
predicate for a partial index is stored in ii_Predicate, not
ii_Expressions. Although ii_Predicate is a list too; in that case it's a
list of clauses that are implicitly ANDed together.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#30

m_lists@yahoo.it

almost 16 years ago

In reply to: Heikki Linnakangas (#29)

Re: About "Our CLUSTER implementation is pessimal" patch

Consider multi-column indexes, ie:
CREATE INDEX i_foo ON foo (length(a), length(b));

Ok, I've never thought of expression indexes that way
(in the (expr1,expr2,exprN) form): that is a good example.

Maybe you're confusing expression indexes with partial indexes?

No no, that was exactly what I needed to know.

Thank you very much

Leonardo

#31

m_lists@yahoo.it

almost 16 years ago

In reply to: Leonardo F (#30)

1 attachment(s)

Re: About "Our CLUSTER implementation is pessimal" patch

Hi all,

attached a patch to do seq scan + sorting instead of index scan

on CLUSTER (when that's supposed to be faster).

As I've already said, the patch is based on:
http://archives.postgresql.org/pgsql-hackers/2008-08/msg01371.php

Of course, the code isn't supposed to be ready to be merged: I
would like to write more comments and add some test cases to
cluster.sql (plus change all the things you are going to tell me I
have to change...)

I would like your opinions on code correctness and the decisions
I took, especially:

1) function names ("cost_index_scan_vs_seqscansort" I guess
it's awful...)

2) the fact that I put in Tuplesortstate an EState variable, so that
MakeSingleTupleTableSlot wouldn't have to be called for every
row in the expression indexes case

3) the expression index case is not "optimized": I preferred to
call FormIndexDatum once for the first key value in
copytup_rawheap and another time to get all the remaining values
in comparetup_rawheap. I liked the idea of re-using
FormIndexDatum in that case, instead of copying&pasting only
the relevant code: but FormIndexDatum returns all the values,

even when I might need only the first one

4) I refactored the code to deform and rewrite tuple into the function
"deform_and_rewrite_tuple", because now that part can be called
by the regular index scan or by the new seq-scan + sort (but I
could copy&paste those lines instead of refactoring them into a new
function)

Suggestions and comments are not just welcome, but needed!

Leonardo

Attachments:

sorted_cluster.patchapplication/octet-stream; name=sorted_cluster.patchDownload

Index: src/backend/optimizer/path/costsize.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/optimizer/path/costsize.c,v
retrieving revision 1.214
diff -c -r1.214 costsize.c
*** src/backend/optimizer/path/costsize.c	5 Jan 2010 21:53:58 -0000	1.214
--- src/backend/optimizer/path/costsize.c	28 Jan 2010 09:25:54 -0000
***************
*** 78,83 ****
--- 78,84 ----
  #include "optimizer/placeholder.h"
  #include "optimizer/planmain.h"
  #include "optimizer/restrictinfo.h"
+ #include "optimizer/plancat.h"
  #include "parser/parsetree.h"
  #include "utils/lsyscache.h"
  #include "utils/selfuncs.h"
***************
*** 2623,2628 ****
--- 2624,2700 ----
  								  (void *) context);
  }
  
+ /*
+  * cost_index_scan_vs_seqscansort
+  *		Estimate the CPU costs of evaluating an index scan and
+  *		a sequence scan + sort. This is needed by CLUSTER to
+  *		a decide the fastest path (index scan vs seq scan + sort).
+  *
+ */
+ void
+ cost_index_scan_vs_seqscansort(Oid tableOid, Oid indexOid,
+ 									Cost 	*indexScanTotCost,
+ 									Cost	*seqAndSortCost)
+ {
+ 	RelOptInfo 	*rel;
+ 	PlannerInfo *root;
+ 	Query 		*query;
+ 	PlannerGlobal *glob;
+ 	RangeTblEntry 	*rte;
+ 	ListCell   		*index;
+ 	IndexPath		*indexScanPath;
+ 	Path 			*seqAndSortPath;
+ 
+ 	rel = makeNode(RelOptInfo);
+ 	rel->reloptkind = RELOPT_BASEREL;
+ 	rel->relid = 1;
+ 	rel->rtekind = RTE_RELATION;
+ 
+ 	/* needed by get_relation_info */
+ 	glob = makeNode(PlannerGlobal);
+ 
+ 	/* needed by get_relation_info: */
+ 	query = makeNode(Query);
+ 	query->resultRelation = 0;
+ 
+ 	root = makeNode(PlannerInfo);
+ 
+ 	root->parse = query;
+ 	root->glob = glob;
+ 
+ 	get_relation_info(root, tableOid, false, rel);
+ 	seqAndSortPath = create_seqscan_path(NULL, rel);
+ 
+ 	rel->rows = rel->tuples;
+ 
+ 	rte = makeNode(RangeTblEntry);
+ 	rte->rtekind = RTE_RELATION;
+ 	rte->relid = tableOid;
+ 
+ 	root->simple_rel_array_size = 2;
+ 	root->simple_rte_array = (RangeTblEntry **)
+ 	palloc0(root->simple_rel_array_size * sizeof(RangeTblEntry *));
+ 	root->simple_rte_array[1] = rte;
+ 
+ 	root->total_table_pages = rel->pages;
+ 	foreach(index, rel->indexlist)
+ 	{
+ 		IndexOptInfo *indexInfo = (IndexOptInfo*)(index->data.ptr_value);
+ 		if (indexInfo->indexoid == indexOid)
+ 		{
+ 			indexScanPath = create_index_path(root, indexInfo, NULL, NULL, ForwardScanDirection, NULL);
+ 			break;
+ 		}
+ 	}
+ 
+ 	Assert(indexScanPath != NULL);
+ 
+ 	cost_sort(seqAndSortPath, root, NULL, seqAndSortPath->total_cost, rel->tuples, rel->width, -1);
+ 
+ 	*seqAndSortCost = seqAndSortPath->total_cost;
+ 	*indexScanTotCost = indexScanPath->path.total_cost;
+ }
+ 
  
  /*
   * adjust_semi_join
Index: src/backend/utils/sort/tuplesort.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/utils/sort/tuplesort.c,v
retrieving revision 1.94
diff -c -r1.94 tuplesort.c
*** src/backend/utils/sort/tuplesort.c	2 Jan 2010 16:57:58 -0000	1.94
--- src/backend/utils/sort/tuplesort.c	28 Jan 2010 09:25:55 -0000
***************
*** 104,109 ****
--- 104,110 ----
  #include "access/nbtree.h"
  #include "catalog/pg_amop.h"
  #include "catalog/pg_operator.h"
+ #include "catalog/index.h"
  #include "commands/tablespace.h"
  #include "miscadmin.h"
  #include "pg_trace.h"
***************
*** 115,126 ****
--- 116,129 ----
  #include "utils/rel.h"
  #include "utils/syscache.h"
  #include "utils/tuplesort.h"
+ #include "executor/executor.h"
  
  
  /* sort-type codes for sort__start probes */
  #define HEAP_SORT	0
  #define INDEX_SORT	1
  #define DATUM_SORT	2
+ #define RAWHEAP_SORT	3
  
  /* GUC variables */
  #ifdef TRACE_SORT
***************
*** 366,371 ****
--- 369,378 ----
  	int			datumTypeLen;
  	bool		datumTypeByVal;
  
+ 	/* These are specific to the rawheap subcase: */
+ 	EState 	   *estate;
+ 	IndexInfo  *indexInfo;
+ 
  	/*
  	 * Resource snapshot for time of sort start.
  	 */
***************
*** 450,455 ****
--- 457,468 ----
  static void readtup_heap(Tuplesortstate *state, SortTuple *stup,
  			 int tapenum, unsigned int len);
  static void reversedirection_heap(Tuplesortstate *state);
+ static int comparetup_rawheap(const SortTuple *a, const SortTuple *b,
+                              Tuplesortstate *state);
+ static void copytup_rawheap(Tuplesortstate *state, SortTuple *stup, void *tup);
+ static void writetup_rawheap(Tuplesortstate *state, int tapenum, SortTuple *stup);
+ static void readtup_rawheap(Tuplesortstate *state, SortTuple *stup, int tapenum,
+                             unsigned int len);
  static int comparetup_index_btree(const SortTuple *a, const SortTuple *b,
  					   Tuplesortstate *state);
  static int comparetup_index_hash(const SortTuple *a, const SortTuple *b,
***************
*** 549,554 ****
--- 562,570 ----
  
  	state->result_tape = -1;	/* flag that result tape has not been formed */
  
+ 	/* set estate to NULL, so we don't try to free it later if not used */
+ 	state->estate = NULL;
+ 
  	MemoryContextSwitchTo(oldcontext);
  
  	return state;
***************
*** 762,767 ****
--- 778,844 ----
  	return state;
  }
  
+ Tuplesortstate *
+ tuplesort_begin_rawheap(Relation indexRel,
+ 						TupleDesc tupDesc,
+                         int workMem, bool randomAccess)
+ {
+     Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+     MemoryContext oldcontext;
+ 
+     TupleTableSlot *existing_slot;
+     ExprContext	   *econtext;
+ 
+     Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
+ 
+     oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ 
+ #ifdef TRACE_SORT
+     if (trace_sort)
+         elog(LOG,
+              "begin tuple sort: nkeys = %d, workMem = %d, randomAccess = %c",
+              RelationGetNumberOfAttributes(indexRel), workMem, randomAccess ? 't' : 'f');
+ #endif
+ 
+ 	state->nKeys = RelationGetNumberOfAttributes(indexRel);
+ 
+ 	TRACE_POSTGRESQL_SORT_START(RAWHEAP_SORT, false, state->nKeys, workMem, randomAccess);
+ 
+     state->comparetup = comparetup_rawheap;
+     state->copytup = copytup_rawheap;
+     state->writetup = writetup_rawheap;
+     state->readtup = readtup_rawheap;
+     state->reversedirection = reversedirection_heap;
+ 
+ 	state->indexInfo = BuildIndexInfo(indexRel);
+ 	state->indexScanKey = _bt_mkscankey_nodata(indexRel);
+ 	state->enforceUnique = false;
+ 
+     state->tupDesc = tupDesc;    /* assume we need not copy tupDesc */
+ 
+     if (state->indexInfo->ii_Expressions != NULL)
+     {
+     	/* allocate the vars used by FormIndexDatum */
+ 		state->estate = CreateExecutorState();
+ 
+ 		/*
+ 		 * Need a TupleTableSlot to put existing tuples in.
+ 		 *
+ 		 * To use FormIndexDatum, we have to make the econtext's scantuple point
+ 		 * to this slot.  Be sure to save and restore caller's value for
+ 		 * scantuple.
+ 		 */
+ 		existing_slot = MakeSingleTupleTableSlot(tupDesc);
+ 
+ 		econtext = GetPerTupleExprContext(state->estate);
+ 		econtext->ecxt_scantuple = existing_slot;
+     }
+ 
+     MemoryContextSwitchTo(oldcontext);
+ 
+     return state;
+ }
+ 
  /*
   * tuplesort_set_bound
   *
***************
*** 849,854 ****
--- 926,934 ----
  	 */
  	TRACE_POSTGRESQL_SORT_DONE(state->tapeset != NULL, 0L);
  #endif
+ 	if (state->estate != NULL)
+ 		ExecDropSingleTupleTableSlot(GetPerTupleExprContext(state->estate)->ecxt_scantuple);
+ 
  
  	MemoryContextSwitchTo(oldcontext);
  
***************
*** 980,985 ****
--- 1060,1087 ----
  }
  
  /*
+ * Accept one tuple while collecting input data for sort.
+ *
+ * Note that the input data is always copied; the caller need not save it.
+ */
+ void
+ tuplesort_putrawtuple(Tuplesortstate *state, HeapTuple tup)
+ {
+     MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
+     SortTuple    stup;
+ 
+     /*
+      * Copy the given tuple into memory we control, and decrease availMem.
+      * Then call the common code.
+      */
+     COPYTUP(state, &stup, (void *) tup);
+ 
+     puttuple_common(state, &stup);
+ 
+     MemoryContextSwitchTo(oldcontext);
+ }
+ 
+ /*
   * Shared code for tuple and datum cases.
   */
  static void
***************
*** 1482,1487 ****
--- 1584,1609 ----
  }
  
  /*
+ * Fetch the next tuple in either forward or back direction.
+ * Returns NULL if no more tuples.    If *should_free is set, the
+ * caller must pfree the returned tuple when done with it.
+ */
+ HeapTuple
+ tuplesort_getrawtuple(Tuplesortstate *state, bool forward,
+                      bool *should_free)
+ {
+     MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
+     SortTuple    stup;
+ 
+     if (!tuplesort_gettuple_common(state, forward, &stup, should_free))
+         stup.tuple = NULL;
+ 
+     MemoryContextSwitchTo(oldcontext);
+ 
+     return stup.tuple;
+ }
+ 
+ /*
   * tuplesort_merge_order - report merge order we'll use for given memory
   * (note: "merge order" just means the number of input tapes in the merge).
   *
***************
*** 3079,3084 ****
--- 3201,3408 ----
  }
  
  /*
+ * Routines specialized for Raw on-disk HeapTuple case These are only used when
+ * we need the visibility info for things like CLUSTER. Otherwise we used the
+ * regular *tup_heap methods which actually manipulate MinimalTuples.
+ */
+ static int
+ comparetup_rawheap(const SortTuple *a, const SortTuple *b, Tuplesortstate *state)
+ {
+     ScanKey        scanKey = state->indexScanKey;
+     HeapTuple ltup;
+     HeapTuple rtup;
+     TupleDesc    tupDesc;
+     int            nkey;
+     int32        compare;
+ 
+     /* Allow interrupting long sorts */
+     CHECK_FOR_INTERRUPTS();
+ 
+     /* Compare the leading sort key */
+     compare = inlineApplySortFunction(&scanKey->sk_func, scanKey->sk_flags,
+                                      a->datum1, a->isnull1,
+                                      b->datum1, b->isnull1);
+     if (compare != 0)
+         return compare;
+ 
+     /* Compare additional sort keys */
+     ltup = (HeapTuple) a->tuple;
+     rtup = (HeapTuple) b->tuple;
+ 
+     if (state->indexInfo->ii_Expressions == NULL)
+     {
+         /* if not expression index, just get the proper heap_getattr */
+ 
+     	tupDesc = state->tupDesc;
+         scanKey++;
+ 
+         for (nkey = 1; nkey < state->nKeys; nkey++, scanKey++)
+         {
+             Datum       datum1,
+                         datum2;
+             bool        isnull1,
+                         isnull2;
+ 
+             datum1 = heap_getattr(ltup, state->indexInfo->ii_KeyAttrNumbers[nkey], tupDesc, &isnull1);
+             datum2 = heap_getattr(rtup, state->indexInfo->ii_KeyAttrNumbers[nkey], tupDesc, &isnull2);
+ 
+             compare = inlineApplySortFunction(&scanKey->sk_func, scanKey->sk_flags,
+                                              datum1, isnull1,
+                                              datum2, isnull2);
+             if (compare != 0)
+                 return compare;
+         }
+     }
+     else
+     {
+     	/* in the expression index case, we get all the values/nulls:
+     	 * it would be faster to get only the required ones, but it would mean
+     	 * copy&paste from FormIndexDatum
+     	*/
+ 
+ 		Datum		l_existing_values[INDEX_MAX_KEYS];
+ 		bool		l_existing_isnull[INDEX_MAX_KEYS];
+ 		Datum		r_existing_values[INDEX_MAX_KEYS];
+ 		bool		r_existing_isnull[INDEX_MAX_KEYS];
+ 		TupleTableSlot *ecxt_scantuple;
+ 
+ 		ecxt_scantuple = GetPerTupleExprContext(state->estate)->ecxt_scantuple;
+ 
+ 		ExecStoreTuple(ltup,	ecxt_scantuple, InvalidBuffer, false);
+     	FormIndexDatum(state->indexInfo, ecxt_scantuple, state->estate,
+ 						l_existing_values, l_existing_isnull);
+ 
+ 		ExecStoreTuple(rtup,	ecxt_scantuple, InvalidBuffer, false);
+     	FormIndexDatum(state->indexInfo, ecxt_scantuple, state->estate,
+ 						r_existing_values, r_existing_isnull);
+ 
+     	for (nkey = 1; nkey < state->nKeys; nkey++, scanKey++)
+         {
+             compare = inlineApplySortFunction(&scanKey->sk_func,
+ 												scanKey->sk_flags,
+ 												l_existing_values[nkey],
+ 												l_existing_isnull[nkey],
+ 												r_existing_values[nkey],
+ 												r_existing_isnull[nkey]);
+ 
+             if (compare != 0)
+                 return compare;
+ 
+         }
+     }
+ 
+ 
+ 
+     return 0;
+ }
+ 
+ static void
+ copytup_rawheap(Tuplesortstate *state, SortTuple *stup, void *tup)
+ {
+     HeapTuple    tuple = (HeapTuple) tup;
+ 
+     /* copy the tuple into sort storage */
+     stup->tuple = (void *) heap_copytuple(tuple);
+     USEMEM(state, GetMemoryChunkSpace(stup->tuple));
+     /* set up first-column key value */
+     if (state->indexInfo->ii_Expressions == NULL)
+     {
+     	/* no expression index, just get the key datum value */
+ 		stup->datum1 = heap_getattr((HeapTuple) stup->tuple,
+ 									state->indexInfo->ii_KeyAttrNumbers[0],
+ 									state->tupDesc,
+ 									&stup->isnull1);
+     }
+     else
+     {
+     	/*
+     	 * Extract the index column values and isnull flags from the existing
+     	 * tuple; we're interested only in the very first one, but to avoid
+     	 * copy&paste from FormIndexDatum we get all of them (even if it's
+     	 * slower)
+ 		*/
+ 
+ 		Datum		existing_values[INDEX_MAX_KEYS];
+ 		bool		existing_isnull[INDEX_MAX_KEYS];
+ 		TupleTableSlot *ecxt_scantuple;
+ 
+ 		ecxt_scantuple = GetPerTupleExprContext(state->estate)->ecxt_scantuple;
+     	ExecStoreTuple(tuple,	ecxt_scantuple, InvalidBuffer, false);
+     	FormIndexDatum(state->indexInfo, ecxt_scantuple, state->estate,
+     				   existing_values, existing_isnull);
+ 
+     	stup->datum1 = existing_values[0];
+     	stup->isnull1 = existing_isnull[0];
+     }
+ }
+ 
+ static void
+ writetup_rawheap(Tuplesortstate *state, int tapenum, SortTuple *stup)
+ {
+     HeapTuple    tuple = (HeapTuple) stup->tuple;
+     tuple->t_len += HEAPTUPLESIZE; /* write out the header as well */
+ 
+     LogicalTapeWrite(state->tapeset, tapenum,
+                      tuple, tuple->t_len);
+ 
+     if (state->randomAccess)    /* need trailing length word? */
+         LogicalTapeWrite(state->tapeset, tapenum,
+                          tuple, sizeof(tuple->t_len));
+ 
+     FREEMEM(state, GetMemoryChunkSpace(tuple));
+     heap_freetuple(tuple);
+ }
+ 
+ static void
+ readtup_rawheap(Tuplesortstate *state, SortTuple *stup,
+              int tapenum, unsigned int tuplen)
+ {
+     HeapTuple    tuple = (HeapTuple) palloc(tuplen);
+ 
+     USEMEM(state, GetMemoryChunkSpace(tuple));
+ 
+     tuple->t_len = tuplen - HEAPTUPLESIZE;
+     if (LogicalTapeRead(state->tapeset, tapenum, &tuple->t_self, tuplen-sizeof(tuplen)) != tuplen-sizeof(tuplen))
+         elog(ERROR, "unexpected end of data");
+     if (state->randomAccess)    /* need trailing length word? */
+         if (LogicalTapeRead(state->tapeset, tapenum, &tuplen,
+                             sizeof(tuplen)) != sizeof(tuplen))
+             elog(ERROR, "unexpected end of data");
+ 
+     stup->tuple = tuple;
+     /* set up first-column key value */
+     if (state->indexInfo->ii_Expressions == NULL)
+     {
+     	/* no expression index, just get the key datum value */
+ 		stup->datum1 = heap_getattr((HeapTuple) stup->tuple,
+ 									state->indexInfo->ii_KeyAttrNumbers[0],
+ 									state->tupDesc,
+ 									&stup->isnull1);
+     }
+     else
+     {
+     	/*
+     	 * Extract the index column values and isnull flags from the existing
+     	 * tuple; we're interested only in the very first one, but to avoid
+     	 * copy&paste from FormIndexDatum we get all of them (even if it's
+     	 * slower)
+ 		*/
+ 
+ 		Datum		existing_values[INDEX_MAX_KEYS];
+ 		bool		existing_isnull[INDEX_MAX_KEYS];
+ 		TupleTableSlot *ecxt_scantuple;
+ 
+ 		ecxt_scantuple = GetPerTupleExprContext(state->estate)->ecxt_scantuple;
+     	ExecStoreTuple(tuple,	ecxt_scantuple, InvalidBuffer, false);
+     	FormIndexDatum(state->indexInfo, ecxt_scantuple, state->estate,
+     				   existing_values, existing_isnull);
+ 
+     	stup->datum1 = existing_values[0];
+     	stup->isnull1 = existing_isnull[0];
+     }
+ }
+ 
+ /*
   * Convenience routine to free a tuple previously loaded into sort memory
   */
  static void
Index: src/include/optimizer/cost.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/optimizer/cost.h,v
retrieving revision 1.100
diff -c -r1.100 cost.h
*** src/include/optimizer/cost.h	2 Jan 2010 16:58:07 -0000	1.100
--- src/include/optimizer/cost.h	28 Jan 2010 09:25:55 -0000
***************
*** 109,114 ****
--- 109,118 ----
  extern void cost_subplan(PlannerInfo *root, SubPlan *subplan, Plan *plan);
  extern void cost_qual_eval(QualCost *cost, List *quals, PlannerInfo *root);
  extern void cost_qual_eval_node(QualCost *cost, Node *qual, PlannerInfo *root);
+ extern void cost_index_scan_vs_seqscansort(Oid tableOid, Oid indexOid,
+ 											Cost 	*indexScanTotCost,
+ 											Cost	*seqAndSortCost);
+ 
  extern void set_baserel_size_estimates(PlannerInfo *root, RelOptInfo *rel);
  extern void set_joinrel_size_estimates(PlannerInfo *root, RelOptInfo *rel,
  						   RelOptInfo *outer_rel,
Index: src/include/utils/tuplesort.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/utils/tuplesort.h,v
retrieving revision 1.35
diff -c -r1.35 tuplesort.h
*** src/include/utils/tuplesort.h	2 Jan 2010 16:58:10 -0000	1.35
--- src/include/utils/tuplesort.h	28 Jan 2010 09:25:55 -0000
***************
*** 64,69 ****
--- 64,72 ----
  extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
  					  Oid sortOperator, bool nullsFirstFlag,
  					  int workMem, bool randomAccess);
+ extern Tuplesortstate *tuplesort_begin_rawheap(Relation indexRel,
+ 												TupleDesc tupDesc,
+ 												int workMem, bool randomAccess);
  
  extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
  
***************
*** 72,77 ****
--- 75,81 ----
  extern void tuplesort_putindextuple(Tuplesortstate *state, IndexTuple tuple);
  extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
  				   bool isNull);
+ extern void tuplesort_putrawtuple(Tuplesortstate *state, HeapTuple tup);
  
  extern void tuplesort_performsort(Tuplesortstate *state);
  
***************
*** 81,86 ****
--- 85,92 ----
  						bool *should_free);
  extern bool tuplesort_getdatum(Tuplesortstate *state, bool forward,
  				   Datum *val, bool *isNull);
+ extern HeapTuple tuplesort_getrawtuple(Tuplesortstate *state, bool forward,
+                                      bool *should_free);
  
  extern void tuplesort_end(Tuplesortstate *state);
  
Index: src/backend/commands/cluster.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/commands/cluster.c,v
retrieving revision 1.194
diff -c -r1.194 cluster.c
*** src/backend/commands/cluster.c	20 Jan 2010 19:43:40 -0000	1.194
--- src/backend/commands/cluster.c	28 Jan 2010 09:25:53 -0000
***************
*** 47,53 ****
  #include "utils/snapmgr.h"
  #include "utils/syscache.h"
  #include "utils/tqual.h"
! 
  
  /*
   * This struct is used to pass around the information on tables to be
--- 47,57 ----
  #include "utils/snapmgr.h"
  #include "utils/syscache.h"
  #include "utils/tqual.h"
! #include "utils/tuplesort.h"
! #include "optimizer/plancat.h"
! #include "optimizer/pathnode.h"
! #include "optimizer/cost.h"
! #include "executor/spi_priv.h"
  
  /*
   * This struct is used to pass around the information on tables to be
***************
*** 66,73 ****
  static TransactionId copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap,
  					Oid OIDOldIndex, int freeze_min_age, int freeze_table_age);
  static List *get_tables_to_cluster(MemoryContext cluster_context);
! 
! 
  
  /*---------------------------------------------------------------------------
   * This cluster code allows for clustering multiple tables at once. Because
--- 70,78 ----
  static TransactionId copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap,
  					Oid OIDOldIndex, int freeze_min_age, int freeze_table_age);
  static List *get_tables_to_cluster(MemoryContext cluster_context);
! static void deform_and_rewrite_tuple(HeapTuple tuple, TupleDesc oldTupDesc, TupleDesc newTupDesc,
! 										Datum *values, bool *isnull,
! 										bool newRelHasOids, RewriteState rwstate);
  
  /*---------------------------------------------------------------------------
   * This cluster code allows for clustering multiple tables at once. Because
***************
*** 791,796 ****
--- 796,803 ----
  	TransactionId OldestXmin;
  	TransactionId FreezeXid;
  	RewriteState rwstate;
+ 	bool 		 use_sort = false;
+ 	Tuplesortstate *tuplesort;
  
  	/*
  	 * Open the relations we need.
***************
*** 860,866 ****
  	 * copied, we scan with SnapshotAny and use HeapTupleSatisfiesVacuum for
  	 * the visibility test.
  	 */
! 	if (OldIndex != NULL)
  	{
  		heapScan = NULL;
  		indexScan = index_beginscan(OldHeap, OldIndex,
--- 867,883 ----
  	 * copied, we scan with SnapshotAny and use HeapTupleSatisfiesVacuum for
  	 * the visibility test.
  	 */
! 	if (OldIndex != NULL && (OldIndex->rd_rel->relam == BTREE_AM_OID))
! 	{
! 		Cost 	indexScanTotCost,
! 				seqAndSortCost;
! 		cost_index_scan_vs_seqscansort(OIDOldHeap, OIDOldIndex,
! 													&indexScanTotCost,
! 													&seqAndSortCost);
! 		use_sort = seqAndSortCost < indexScanTotCost;
! 	}
! 
! 	if (OldIndex != NULL && !use_sort)
  	{
  		heapScan = NULL;
  		indexScan = index_beginscan(OldHeap, OldIndex,
***************
*** 869,888 ****
  	else
  	{
  		heapScan = heap_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL);
  		indexScan = NULL;
  	}
  
  	for (;;)
  	{
  		HeapTuple	tuple;
- 		HeapTuple	copiedTuple;
  		Buffer		buf;
! 		bool		isdead;
! 		int			i;
  
  		CHECK_FOR_INTERRUPTS();
  
! 		if (OldIndex != NULL)
  		{
  			tuple = index_getnext(indexScan, ForwardScanDirection);
  			if (tuple == NULL)
--- 886,909 ----
  	else
  	{
  		heapScan = heap_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL);
+ 		if (use_sort)
+ 		{
+ 			tuplesort = tuplesort_begin_rawheap(OldIndex, oldTupDesc,
+ 		                                        maintenance_work_mem, false);
+ 		}
  		indexScan = NULL;
  	}
  
  	for (;;)
  	{
+ 
  		HeapTuple	tuple;
  		Buffer		buf;
! 		bool isdead;
  
  		CHECK_FOR_INTERRUPTS();
  
! 		if (OldIndex != NULL && !use_sort)
  		{
  			tuple = index_getnext(indexScan, ForwardScanDirection);
  			if (tuple == NULL)
***************
*** 902,908 ****
  
  			buf = heapScan->rs_cbuf;
  		}
- 
  		LockBuffer(buf, BUFFER_LOCK_SHARE);
  
  		switch (HeapTupleSatisfiesVacuum(tuple->t_data, OldestXmin, buf))
--- 923,928 ----
***************
*** 956,1000 ****
  			continue;
  		}
  
! 		/*
! 		 * We cannot simply copy the tuple as-is, for several reasons:
! 		 *
! 		 * 1. We'd like to squeeze out the values of any dropped columns, both
! 		 * to save space and to ensure we have no corner-case failures. (It's
! 		 * possible for example that the new table hasn't got a TOAST table
! 		 * and so is unable to store any large values of dropped cols.)
! 		 *
! 		 * 2. The tuple might not even be legal for the new table; this is
! 		 * currently only known to happen as an after-effect of ALTER TABLE
! 		 * SET WITHOUT OIDS.
! 		 *
! 		 * So, we must reconstruct the tuple from component Datums.
! 		 */
! 		heap_deform_tuple(tuple, oldTupDesc, values, isnull);
! 
! 		/* Be sure to null out any dropped columns */
! 		for (i = 0; i < natts; i++)
  		{
! 			if (newTupDesc->attrs[i]->attisdropped)
! 				isnull[i] = true;
  		}
  
- 		copiedTuple = heap_form_tuple(newTupDesc, values, isnull);
  
! 		/* Preserve OID, if any */
! 		if (NewHeap->rd_rel->relhasoids)
! 			HeapTupleSetOid(copiedTuple, HeapTupleGetOid(tuple));
  
! 		/* The heap rewrite module does the rest */
! 		rewrite_heap_tuple(rwstate, tuple, copiedTuple);
  
! 		heap_freetuple(copiedTuple);
  	}
  
! 	if (OldIndex != NULL)
  		index_endscan(indexScan);
  	else
  		heap_endscan(heapScan);
  
  	/* Write out any remaining tuples, and fsync if needed */
  	end_heap_rewrite(rwstate);
--- 976,1029 ----
  			continue;
  		}
  
! 		if (!use_sort)
  		{
! 			deform_and_rewrite_tuple(tuple, oldTupDesc, newTupDesc, values, isnull,
! 										NewHeap->rd_rel->relhasoids, rwstate);
! 		}
! 		else
! 		{
! 			/* pass tuple to tuple store */
! 			tuplesort_putrawtuple(tuplesort, tuple);
  		}
  
  
! 	}
  
! 	if (use_sort)
! 	{
! 		tuplesort_performsort(tuplesort);
! 
! 		/* read from tuplestore */
! 		for (;;)
! 		{
! 			HeapTuple	tuple;
! 			bool        shouldfree;
  
! 			CHECK_FOR_INTERRUPTS();
! 
! 			tuple = tuplesort_getrawtuple(tuplesort, true, &shouldfree);
! 			if (tuple == NULL)
! 				break;
! 
! 			deform_and_rewrite_tuple(tuple, oldTupDesc, newTupDesc, values, isnull,
! 										NewHeap->rd_rel->relhasoids, rwstate);
! 			if (shouldfree)
! 				heap_freetuple(tuple);
! 		}
  	}
  
! 
! 	if (OldIndex != NULL && !use_sort)
! 	{
  		index_endscan(indexScan);
+ 	}
  	else
+ 	{
  		heap_endscan(heapScan);
+ 		if (use_sort)
+ 			tuplesort_end(tuplesort);
+ 	}
  
  	/* Write out any remaining tuples, and fsync if needed */
  	end_heap_rewrite(rwstate);
***************
*** 1237,1239 ****
--- 1266,1312 ----
  
  	return rvs;
  }
+ 
+ static void deform_and_rewrite_tuple(HeapTuple tuple, TupleDesc oldTupDesc, TupleDesc newTupDesc,
+ 										Datum *values, bool *isnull,
+ 										bool newRelHasOids, RewriteState rwstate)
+ {
+ 	HeapTuple	copiedTuple;
+ 	int 		i;
+ 
+ 	/*
+ 	 * We cannot simply copy the tuple as-is, for several reasons:
+ 	 *
+ 	 * 1. We'd like to squeeze out the values of any dropped columns, both
+ 	 * to save space and to ensure we have no corner-case failures. (It's
+ 	 * possible for example that the new table hasn't got a TOAST table
+ 	 * and so is unable to store any large values of dropped cols.)
+ 	 *
+ 	 * 2. The tuple might not even be legal for the new table; this is
+ 	 * currently only known to happen as an after-effect of ALTER TABLE
+ 	 * SET WITHOUT OIDS.
+ 	 *
+ 	 * So, we must reconstruct the tuple from component Datums.
+ 	 */
+ 
+ 	heap_deform_tuple(tuple, oldTupDesc, values, isnull);
+ 
+ 	/* Be sure to null out any dropped columns */
+ 	for (i = 0; i < newTupDesc->natts; i++)
+ 	{
+ 		if (newTupDesc->attrs[i]->attisdropped)
+ 			isnull[i] = true;
+ 	}
+ 
+ 	copiedTuple = heap_form_tuple(newTupDesc, values, isnull);
+ 
+ 	/* Preserve OID, if any */
+ 	if (newRelHasOids)
+ 		HeapTupleSetOid(copiedTuple, HeapTupleGetOid(tuple));
+ 
+ 	/* The heap rewrite module does the rest */
+ 	rewrite_heap_tuple(rwstate, tuple, copiedTuple);
+ 
+ 	heap_freetuple(copiedTuple);
+ }
+

Import Notes

Resolved by subject fallback

#32

m_lists@yahoo.it

almost 16 years ago

In reply to: Leonardo F (#31)

1 attachment(s)

Re: About "Our CLUSTER implementation is pessimal" patch

I know you're all very busy getting 9.0 out, but I think the results in
heap scanning + sort instead of index scanning for CLUSTER are
very good... I would like to know if I did something wrong/I should
improve something in the patch... I haven't tested it with index
expressions yet (but the tests in "make check" work).

Thanks

Leonardo

Show quoted text

Hi all,

attached a patch to do seq scan + sorting instead of index scan

on CLUSTER (when that's supposed to be faster).

As I've already said, the patch is based on:
http://archives.postgresql.org/pgsql-hackers/2008-08/msg01371.php

Of course, the code isn't supposed to be ready to be merged: I
would like to write more comments and add some test cases to
cluster.sql (plus change all the things you are going to tell me I
have to change...)

I would like your opinions on code correctness and the decisions
I took, especially:

1) function names ("cost_index_scan_vs_seqscansort" I guess
it's awful...)

2) the fact that I put in Tuplesortstate an EState variable, so that
MakeSingleTupleTableSlot wouldn't have to be called for every
row in the expression indexes case

3) the expression index case is not "optimized": I preferred to
call FormIndexDatum once for the first key value in
copytup_rawheap and another time to get all the remaining values
in comparetup_rawheap. I liked the idea of re-using
FormIndexDatum in that case, instead of copying&pasting only
the relevant code: but FormIndexDatum returns all the values,

even when I might need only the first one

4) I refactored the code to deform and rewrite tuple into the function
"deform_and_rewrite_tuple", because now that part can be called
by the regular index scan or by the new seq-scan + sort (but I
could copy&paste those lines instead of refactoring them into a new
function)

Suggestions and comments are not just welcome, but needed!

Attachments:

sorted_cluster.patchapplication/octet-stream; name=sorted_cluster.patchDownload

Index: src/backend/optimizer/path/costsize.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/optimizer/path/costsize.c,v
retrieving revision 1.214
diff -c -r1.214 costsize.c
*** src/backend/optimizer/path/costsize.c	5 Jan 2010 21:53:58 -0000	1.214
--- src/backend/optimizer/path/costsize.c	28 Jan 2010 09:25:54 -0000
***************
*** 78,83 ****
--- 78,84 ----
  #include "optimizer/placeholder.h"
  #include "optimizer/planmain.h"
  #include "optimizer/restrictinfo.h"
+ #include "optimizer/plancat.h"
  #include "parser/parsetree.h"
  #include "utils/lsyscache.h"
  #include "utils/selfuncs.h"
***************
*** 2623,2628 ****
--- 2624,2700 ----
  								  (void *) context);
  }
  
+ /*
+  * cost_index_scan_vs_seqscansort
+  *		Estimate the CPU costs of evaluating an index scan and
+  *		a sequence scan + sort. This is needed by CLUSTER to
+  *		a decide the fastest path (index scan vs seq scan + sort).
+  *
+ */
+ void
+ cost_index_scan_vs_seqscansort(Oid tableOid, Oid indexOid,
+ 									Cost 	*indexScanTotCost,
+ 									Cost	*seqAndSortCost)
+ {
+ 	RelOptInfo 	*rel;
+ 	PlannerInfo *root;
+ 	Query 		*query;
+ 	PlannerGlobal *glob;
+ 	RangeTblEntry 	*rte;
+ 	ListCell   		*index;
+ 	IndexPath		*indexScanPath;
+ 	Path 			*seqAndSortPath;
+ 
+ 	rel = makeNode(RelOptInfo);
+ 	rel->reloptkind = RELOPT_BASEREL;
+ 	rel->relid = 1;
+ 	rel->rtekind = RTE_RELATION;
+ 
+ 	/* needed by get_relation_info */
+ 	glob = makeNode(PlannerGlobal);
+ 
+ 	/* needed by get_relation_info: */
+ 	query = makeNode(Query);
+ 	query->resultRelation = 0;
+ 
+ 	root = makeNode(PlannerInfo);
+ 
+ 	root->parse = query;
+ 	root->glob = glob;
+ 
+ 	get_relation_info(root, tableOid, false, rel);
+ 	seqAndSortPath = create_seqscan_path(NULL, rel);
+ 
+ 	rel->rows = rel->tuples;
+ 
+ 	rte = makeNode(RangeTblEntry);
+ 	rte->rtekind = RTE_RELATION;
+ 	rte->relid = tableOid;
+ 
+ 	root->simple_rel_array_size = 2;
+ 	root->simple_rte_array = (RangeTblEntry **)
+ 	palloc0(root->simple_rel_array_size * sizeof(RangeTblEntry *));
+ 	root->simple_rte_array[1] = rte;
+ 
+ 	root->total_table_pages = rel->pages;
+ 	foreach(index, rel->indexlist)
+ 	{
+ 		IndexOptInfo *indexInfo = (IndexOptInfo*)(index->data.ptr_value);
+ 		if (indexInfo->indexoid == indexOid)
+ 		{
+ 			indexScanPath = create_index_path(root, indexInfo, NULL, NULL, ForwardScanDirection, NULL);
+ 			break;
+ 		}
+ 	}
+ 
+ 	Assert(indexScanPath != NULL);
+ 
+ 	cost_sort(seqAndSortPath, root, NULL, seqAndSortPath->total_cost, rel->tuples, rel->width, -1);
+ 
+ 	*seqAndSortCost = seqAndSortPath->total_cost;
+ 	*indexScanTotCost = indexScanPath->path.total_cost;
+ }
+ 
  
  /*
   * adjust_semi_join
Index: src/backend/utils/sort/tuplesort.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/utils/sort/tuplesort.c,v
retrieving revision 1.94
diff -c -r1.94 tuplesort.c
*** src/backend/utils/sort/tuplesort.c	2 Jan 2010 16:57:58 -0000	1.94
--- src/backend/utils/sort/tuplesort.c	28 Jan 2010 09:25:55 -0000
***************
*** 104,109 ****
--- 104,110 ----
  #include "access/nbtree.h"
  #include "catalog/pg_amop.h"
  #include "catalog/pg_operator.h"
+ #include "catalog/index.h"
  #include "commands/tablespace.h"
  #include "miscadmin.h"
  #include "pg_trace.h"
***************
*** 115,126 ****
--- 116,129 ----
  #include "utils/rel.h"
  #include "utils/syscache.h"
  #include "utils/tuplesort.h"
+ #include "executor/executor.h"
  
  
  /* sort-type codes for sort__start probes */
  #define HEAP_SORT	0
  #define INDEX_SORT	1
  #define DATUM_SORT	2
+ #define RAWHEAP_SORT	3
  
  /* GUC variables */
  #ifdef TRACE_SORT
***************
*** 366,371 ****
--- 369,378 ----
  	int			datumTypeLen;
  	bool		datumTypeByVal;
  
+ 	/* These are specific to the rawheap subcase: */
+ 	EState 	   *estate;
+ 	IndexInfo  *indexInfo;
+ 
  	/*
  	 * Resource snapshot for time of sort start.
  	 */
***************
*** 450,455 ****
--- 457,468 ----
  static void readtup_heap(Tuplesortstate *state, SortTuple *stup,
  			 int tapenum, unsigned int len);
  static void reversedirection_heap(Tuplesortstate *state);
+ static int comparetup_rawheap(const SortTuple *a, const SortTuple *b,
+                              Tuplesortstate *state);
+ static void copytup_rawheap(Tuplesortstate *state, SortTuple *stup, void *tup);
+ static void writetup_rawheap(Tuplesortstate *state, int tapenum, SortTuple *stup);
+ static void readtup_rawheap(Tuplesortstate *state, SortTuple *stup, int tapenum,
+                             unsigned int len);
  static int comparetup_index_btree(const SortTuple *a, const SortTuple *b,
  					   Tuplesortstate *state);
  static int comparetup_index_hash(const SortTuple *a, const SortTuple *b,
***************
*** 549,554 ****
--- 562,570 ----
  
  	state->result_tape = -1;	/* flag that result tape has not been formed */
  
+ 	/* set estate to NULL, so we don't try to free it later if not used */
+ 	state->estate = NULL;
+ 
  	MemoryContextSwitchTo(oldcontext);
  
  	return state;
***************
*** 762,767 ****
--- 778,844 ----
  	return state;
  }
  
+ Tuplesortstate *
+ tuplesort_begin_rawheap(Relation indexRel,
+ 						TupleDesc tupDesc,
+                         int workMem, bool randomAccess)
+ {
+     Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+     MemoryContext oldcontext;
+ 
+     TupleTableSlot *existing_slot;
+     ExprContext	   *econtext;
+ 
+     Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
+ 
+     oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ 
+ #ifdef TRACE_SORT
+     if (trace_sort)
+         elog(LOG,
+              "begin tuple sort: nkeys = %d, workMem = %d, randomAccess = %c",
+              RelationGetNumberOfAttributes(indexRel), workMem, randomAccess ? 't' : 'f');
+ #endif
+ 
+ 	state->nKeys = RelationGetNumberOfAttributes(indexRel);
+ 
+ 	TRACE_POSTGRESQL_SORT_START(RAWHEAP_SORT, false, state->nKeys, workMem, randomAccess);
+ 
+     state->comparetup = comparetup_rawheap;
+     state->copytup = copytup_rawheap;
+     state->writetup = writetup_rawheap;
+     state->readtup = readtup_rawheap;
+     state->reversedirection = reversedirection_heap;
+ 
+ 	state->indexInfo = BuildIndexInfo(indexRel);
+ 	state->indexScanKey = _bt_mkscankey_nodata(indexRel);
+ 	state->enforceUnique = false;
+ 
+     state->tupDesc = tupDesc;    /* assume we need not copy tupDesc */
+ 
+     if (state->indexInfo->ii_Expressions != NULL)
+     {
+     	/* allocate the vars used by FormIndexDatum */
+ 		state->estate = CreateExecutorState();
+ 
+ 		/*
+ 		 * Need a TupleTableSlot to put existing tuples in.
+ 		 *
+ 		 * To use FormIndexDatum, we have to make the econtext's scantuple point
+ 		 * to this slot.  Be sure to save and restore caller's value for
+ 		 * scantuple.
+ 		 */
+ 		existing_slot = MakeSingleTupleTableSlot(tupDesc);
+ 
+ 		econtext = GetPerTupleExprContext(state->estate);
+ 		econtext->ecxt_scantuple = existing_slot;
+     }
+ 
+     MemoryContextSwitchTo(oldcontext);
+ 
+     return state;
+ }
+ 
  /*
   * tuplesort_set_bound
   *
***************
*** 849,854 ****
--- 926,934 ----
  	 */
  	TRACE_POSTGRESQL_SORT_DONE(state->tapeset != NULL, 0L);
  #endif
+ 	if (state->estate != NULL)
+ 		ExecDropSingleTupleTableSlot(GetPerTupleExprContext(state->estate)->ecxt_scantuple);
+ 
  
  	MemoryContextSwitchTo(oldcontext);
  
***************
*** 980,985 ****
--- 1060,1087 ----
  }
  
  /*
+ * Accept one tuple while collecting input data for sort.
+ *
+ * Note that the input data is always copied; the caller need not save it.
+ */
+ void
+ tuplesort_putrawtuple(Tuplesortstate *state, HeapTuple tup)
+ {
+     MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
+     SortTuple    stup;
+ 
+     /*
+      * Copy the given tuple into memory we control, and decrease availMem.
+      * Then call the common code.
+      */
+     COPYTUP(state, &stup, (void *) tup);
+ 
+     puttuple_common(state, &stup);
+ 
+     MemoryContextSwitchTo(oldcontext);
+ }
+ 
+ /*
   * Shared code for tuple and datum cases.
   */
  static void
***************
*** 1482,1487 ****
--- 1584,1609 ----
  }
  
  /*
+ * Fetch the next tuple in either forward or back direction.
+ * Returns NULL if no more tuples.    If *should_free is set, the
+ * caller must pfree the returned tuple when done with it.
+ */
+ HeapTuple
+ tuplesort_getrawtuple(Tuplesortstate *state, bool forward,
+                      bool *should_free)
+ {
+     MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
+     SortTuple    stup;
+ 
+     if (!tuplesort_gettuple_common(state, forward, &stup, should_free))
+         stup.tuple = NULL;
+ 
+     MemoryContextSwitchTo(oldcontext);
+ 
+     return stup.tuple;
+ }
+ 
+ /*
   * tuplesort_merge_order - report merge order we'll use for given memory
   * (note: "merge order" just means the number of input tapes in the merge).
   *
***************
*** 3079,3084 ****
--- 3201,3408 ----
  }
  
  /*
+ * Routines specialized for Raw on-disk HeapTuple case These are only used when
+ * we need the visibility info for things like CLUSTER. Otherwise we used the
+ * regular *tup_heap methods which actually manipulate MinimalTuples.
+ */
+ static int
+ comparetup_rawheap(const SortTuple *a, const SortTuple *b, Tuplesortstate *state)
+ {
+     ScanKey        scanKey = state->indexScanKey;
+     HeapTuple ltup;
+     HeapTuple rtup;
+     TupleDesc    tupDesc;
+     int            nkey;
+     int32        compare;
+ 
+     /* Allow interrupting long sorts */
+     CHECK_FOR_INTERRUPTS();
+ 
+     /* Compare the leading sort key */
+     compare = inlineApplySortFunction(&scanKey->sk_func, scanKey->sk_flags,
+                                      a->datum1, a->isnull1,
+                                      b->datum1, b->isnull1);
+     if (compare != 0)
+         return compare;
+ 
+     /* Compare additional sort keys */
+     ltup = (HeapTuple) a->tuple;
+     rtup = (HeapTuple) b->tuple;
+ 
+     if (state->indexInfo->ii_Expressions == NULL)
+     {
+         /* if not expression index, just get the proper heap_getattr */
+ 
+     	tupDesc = state->tupDesc;
+         scanKey++;
+ 
+         for (nkey = 1; nkey < state->nKeys; nkey++, scanKey++)
+         {
+             Datum       datum1,
+                         datum2;
+             bool        isnull1,
+                         isnull2;
+ 
+             datum1 = heap_getattr(ltup, state->indexInfo->ii_KeyAttrNumbers[nkey], tupDesc, &isnull1);
+             datum2 = heap_getattr(rtup, state->indexInfo->ii_KeyAttrNumbers[nkey], tupDesc, &isnull2);
+ 
+             compare = inlineApplySortFunction(&scanKey->sk_func, scanKey->sk_flags,
+                                              datum1, isnull1,
+                                              datum2, isnull2);
+             if (compare != 0)
+                 return compare;
+         }
+     }
+     else
+     {
+     	/* in the expression index case, we get all the values/nulls:
+     	 * it would be faster to get only the required ones, but it would mean
+     	 * copy&paste from FormIndexDatum
+     	*/
+ 
+ 		Datum		l_existing_values[INDEX_MAX_KEYS];
+ 		bool		l_existing_isnull[INDEX_MAX_KEYS];
+ 		Datum		r_existing_values[INDEX_MAX_KEYS];
+ 		bool		r_existing_isnull[INDEX_MAX_KEYS];
+ 		TupleTableSlot *ecxt_scantuple;
+ 
+ 		ecxt_scantuple = GetPerTupleExprContext(state->estate)->ecxt_scantuple;
+ 
+ 		ExecStoreTuple(ltup,	ecxt_scantuple, InvalidBuffer, false);
+     	FormIndexDatum(state->indexInfo, ecxt_scantuple, state->estate,
+ 						l_existing_values, l_existing_isnull);
+ 
+ 		ExecStoreTuple(rtup,	ecxt_scantuple, InvalidBuffer, false);
+     	FormIndexDatum(state->indexInfo, ecxt_scantuple, state->estate,
+ 						r_existing_values, r_existing_isnull);
+ 
+     	for (nkey = 1; nkey < state->nKeys; nkey++, scanKey++)
+         {
+             compare = inlineApplySortFunction(&scanKey->sk_func,
+ 												scanKey->sk_flags,
+ 												l_existing_values[nkey],
+ 												l_existing_isnull[nkey],
+ 												r_existing_values[nkey],
+ 												r_existing_isnull[nkey]);
+ 
+             if (compare != 0)
+                 return compare;
+ 
+         }
+     }
+ 
+ 
+ 
+     return 0;
+ }
+ 
+ static void
+ copytup_rawheap(Tuplesortstate *state, SortTuple *stup, void *tup)
+ {
+     HeapTuple    tuple = (HeapTuple) tup;
+ 
+     /* copy the tuple into sort storage */
+     stup->tuple = (void *) heap_copytuple(tuple);
+     USEMEM(state, GetMemoryChunkSpace(stup->tuple));
+     /* set up first-column key value */
+     if (state->indexInfo->ii_Expressions == NULL)
+     {
+     	/* no expression index, just get the key datum value */
+ 		stup->datum1 = heap_getattr((HeapTuple) stup->tuple,
+ 									state->indexInfo->ii_KeyAttrNumbers[0],
+ 									state->tupDesc,
+ 									&stup->isnull1);
+     }
+     else
+     {
+     	/*
+     	 * Extract the index column values and isnull flags from the existing
+     	 * tuple; we're interested only in the very first one, but to avoid
+     	 * copy&paste from FormIndexDatum we get all of them (even if it's
+     	 * slower)
+ 		*/
+ 
+ 		Datum		existing_values[INDEX_MAX_KEYS];
+ 		bool		existing_isnull[INDEX_MAX_KEYS];
+ 		TupleTableSlot *ecxt_scantuple;
+ 
+ 		ecxt_scantuple = GetPerTupleExprContext(state->estate)->ecxt_scantuple;
+     	ExecStoreTuple(tuple,	ecxt_scantuple, InvalidBuffer, false);
+     	FormIndexDatum(state->indexInfo, ecxt_scantuple, state->estate,
+     				   existing_values, existing_isnull);
+ 
+     	stup->datum1 = existing_values[0];
+     	stup->isnull1 = existing_isnull[0];
+     }
+ }
+ 
+ static void
+ writetup_rawheap(Tuplesortstate *state, int tapenum, SortTuple *stup)
+ {
+     HeapTuple    tuple = (HeapTuple) stup->tuple;
+     tuple->t_len += HEAPTUPLESIZE; /* write out the header as well */
+ 
+     LogicalTapeWrite(state->tapeset, tapenum,
+                      tuple, tuple->t_len);
+ 
+     if (state->randomAccess)    /* need trailing length word? */
+         LogicalTapeWrite(state->tapeset, tapenum,
+                          tuple, sizeof(tuple->t_len));
+ 
+     FREEMEM(state, GetMemoryChunkSpace(tuple));
+     heap_freetuple(tuple);
+ }
+ 
+ static void
+ readtup_rawheap(Tuplesortstate *state, SortTuple *stup,
+              int tapenum, unsigned int tuplen)
+ {
+     HeapTuple    tuple = (HeapTuple) palloc(tuplen);
+ 
+     USEMEM(state, GetMemoryChunkSpace(tuple));
+ 
+     tuple->t_len = tuplen - HEAPTUPLESIZE;
+     if (LogicalTapeRead(state->tapeset, tapenum, &tuple->t_self, tuplen-sizeof(tuplen)) != tuplen-sizeof(tuplen))
+         elog(ERROR, "unexpected end of data");
+     if (state->randomAccess)    /* need trailing length word? */
+         if (LogicalTapeRead(state->tapeset, tapenum, &tuplen,
+                             sizeof(tuplen)) != sizeof(tuplen))
+             elog(ERROR, "unexpected end of data");
+ 
+     stup->tuple = tuple;
+     /* set up first-column key value */
+     if (state->indexInfo->ii_Expressions == NULL)
+     {
+     	/* no expression index, just get the key datum value */
+ 		stup->datum1 = heap_getattr((HeapTuple) stup->tuple,
+ 									state->indexInfo->ii_KeyAttrNumbers[0],
+ 									state->tupDesc,
+ 									&stup->isnull1);
+     }
+     else
+     {
+     	/*
+     	 * Extract the index column values and isnull flags from the existing
+     	 * tuple; we're interested only in the very first one, but to avoid
+     	 * copy&paste from FormIndexDatum we get all of them (even if it's
+     	 * slower)
+ 		*/
+ 
+ 		Datum		existing_values[INDEX_MAX_KEYS];
+ 		bool		existing_isnull[INDEX_MAX_KEYS];
+ 		TupleTableSlot *ecxt_scantuple;
+ 
+ 		ecxt_scantuple = GetPerTupleExprContext(state->estate)->ecxt_scantuple;
+     	ExecStoreTuple(tuple,	ecxt_scantuple, InvalidBuffer, false);
+     	FormIndexDatum(state->indexInfo, ecxt_scantuple, state->estate,
+     				   existing_values, existing_isnull);
+ 
+     	stup->datum1 = existing_values[0];
+     	stup->isnull1 = existing_isnull[0];
+     }
+ }
+ 
+ /*
   * Convenience routine to free a tuple previously loaded into sort memory
   */
  static void
Index: src/include/optimizer/cost.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/optimizer/cost.h,v
retrieving revision 1.100
diff -c -r1.100 cost.h
*** src/include/optimizer/cost.h	2 Jan 2010 16:58:07 -0000	1.100
--- src/include/optimizer/cost.h	28 Jan 2010 09:25:55 -0000
***************
*** 109,114 ****
--- 109,118 ----
  extern void cost_subplan(PlannerInfo *root, SubPlan *subplan, Plan *plan);
  extern void cost_qual_eval(QualCost *cost, List *quals, PlannerInfo *root);
  extern void cost_qual_eval_node(QualCost *cost, Node *qual, PlannerInfo *root);
+ extern void cost_index_scan_vs_seqscansort(Oid tableOid, Oid indexOid,
+ 											Cost 	*indexScanTotCost,
+ 											Cost	*seqAndSortCost);
+ 
  extern void set_baserel_size_estimates(PlannerInfo *root, RelOptInfo *rel);
  extern void set_joinrel_size_estimates(PlannerInfo *root, RelOptInfo *rel,
  						   RelOptInfo *outer_rel,
Index: src/include/utils/tuplesort.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/utils/tuplesort.h,v
retrieving revision 1.35
diff -c -r1.35 tuplesort.h
*** src/include/utils/tuplesort.h	2 Jan 2010 16:58:10 -0000	1.35
--- src/include/utils/tuplesort.h	28 Jan 2010 09:25:55 -0000
***************
*** 64,69 ****
--- 64,72 ----
  extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
  					  Oid sortOperator, bool nullsFirstFlag,
  					  int workMem, bool randomAccess);
+ extern Tuplesortstate *tuplesort_begin_rawheap(Relation indexRel,
+ 												TupleDesc tupDesc,
+ 												int workMem, bool randomAccess);
  
  extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
  
***************
*** 72,77 ****
--- 75,81 ----
  extern void tuplesort_putindextuple(Tuplesortstate *state, IndexTuple tuple);
  extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
  				   bool isNull);
+ extern void tuplesort_putrawtuple(Tuplesortstate *state, HeapTuple tup);
  
  extern void tuplesort_performsort(Tuplesortstate *state);
  
***************
*** 81,86 ****
--- 85,92 ----
  						bool *should_free);
  extern bool tuplesort_getdatum(Tuplesortstate *state, bool forward,
  				   Datum *val, bool *isNull);
+ extern HeapTuple tuplesort_getrawtuple(Tuplesortstate *state, bool forward,
+                                      bool *should_free);
  
  extern void tuplesort_end(Tuplesortstate *state);
  
Index: src/backend/commands/cluster.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/commands/cluster.c,v
retrieving revision 1.194
diff -c -r1.194 cluster.c
*** src/backend/commands/cluster.c	20 Jan 2010 19:43:40 -0000	1.194
--- src/backend/commands/cluster.c	28 Jan 2010 09:25:53 -0000
***************
*** 47,53 ****
  #include "utils/snapmgr.h"
  #include "utils/syscache.h"
  #include "utils/tqual.h"
! 
  
  /*
   * This struct is used to pass around the information on tables to be
--- 47,57 ----
  #include "utils/snapmgr.h"
  #include "utils/syscache.h"
  #include "utils/tqual.h"
! #include "utils/tuplesort.h"
! #include "optimizer/plancat.h"
! #include "optimizer/pathnode.h"
! #include "optimizer/cost.h"
! #include "executor/spi_priv.h"
  
  /*
   * This struct is used to pass around the information on tables to be
***************
*** 66,73 ****
  static TransactionId copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap,
  					Oid OIDOldIndex, int freeze_min_age, int freeze_table_age);
  static List *get_tables_to_cluster(MemoryContext cluster_context);
! 
! 
  
  /*---------------------------------------------------------------------------
   * This cluster code allows for clustering multiple tables at once. Because
--- 70,78 ----
  static TransactionId copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap,
  					Oid OIDOldIndex, int freeze_min_age, int freeze_table_age);
  static List *get_tables_to_cluster(MemoryContext cluster_context);
! static void deform_and_rewrite_tuple(HeapTuple tuple, TupleDesc oldTupDesc, TupleDesc newTupDesc,
! 										Datum *values, bool *isnull,
! 										bool newRelHasOids, RewriteState rwstate);
  
  /*---------------------------------------------------------------------------
   * This cluster code allows for clustering multiple tables at once. Because
***************
*** 791,796 ****
--- 796,803 ----
  	TransactionId OldestXmin;
  	TransactionId FreezeXid;
  	RewriteState rwstate;
+ 	bool 		 use_sort = false;
+ 	Tuplesortstate *tuplesort;
  
  	/*
  	 * Open the relations we need.
***************
*** 860,866 ****
  	 * copied, we scan with SnapshotAny and use HeapTupleSatisfiesVacuum for
  	 * the visibility test.
  	 */
! 	if (OldIndex != NULL)
  	{
  		heapScan = NULL;
  		indexScan = index_beginscan(OldHeap, OldIndex,
--- 867,883 ----
  	 * copied, we scan with SnapshotAny and use HeapTupleSatisfiesVacuum for
  	 * the visibility test.
  	 */
! 	if (OldIndex != NULL && (OldIndex->rd_rel->relam == BTREE_AM_OID))
! 	{
! 		Cost 	indexScanTotCost,
! 				seqAndSortCost;
! 		cost_index_scan_vs_seqscansort(OIDOldHeap, OIDOldIndex,
! 													&indexScanTotCost,
! 													&seqAndSortCost);
! 		use_sort = seqAndSortCost < indexScanTotCost;
! 	}
! 
! 	if (OldIndex != NULL && !use_sort)
  	{
  		heapScan = NULL;
  		indexScan = index_beginscan(OldHeap, OldIndex,
***************
*** 869,888 ****
  	else
  	{
  		heapScan = heap_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL);
  		indexScan = NULL;
  	}
  
  	for (;;)
  	{
  		HeapTuple	tuple;
- 		HeapTuple	copiedTuple;
  		Buffer		buf;
! 		bool		isdead;
! 		int			i;
  
  		CHECK_FOR_INTERRUPTS();
  
! 		if (OldIndex != NULL)
  		{
  			tuple = index_getnext(indexScan, ForwardScanDirection);
  			if (tuple == NULL)
--- 886,909 ----
  	else
  	{
  		heapScan = heap_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL);
+ 		if (use_sort)
+ 		{
+ 			tuplesort = tuplesort_begin_rawheap(OldIndex, oldTupDesc,
+ 		                                        maintenance_work_mem, false);
+ 		}
  		indexScan = NULL;
  	}
  
  	for (;;)
  	{
+ 
  		HeapTuple	tuple;
  		Buffer		buf;
! 		bool isdead;
  
  		CHECK_FOR_INTERRUPTS();
  
! 		if (OldIndex != NULL && !use_sort)
  		{
  			tuple = index_getnext(indexScan, ForwardScanDirection);
  			if (tuple == NULL)
***************
*** 902,908 ****
  
  			buf = heapScan->rs_cbuf;
  		}
- 
  		LockBuffer(buf, BUFFER_LOCK_SHARE);
  
  		switch (HeapTupleSatisfiesVacuum(tuple->t_data, OldestXmin, buf))
--- 923,928 ----
***************
*** 956,1000 ****
  			continue;
  		}
  
! 		/*
! 		 * We cannot simply copy the tuple as-is, for several reasons:
! 		 *
! 		 * 1. We'd like to squeeze out the values of any dropped columns, both
! 		 * to save space and to ensure we have no corner-case failures. (It's
! 		 * possible for example that the new table hasn't got a TOAST table
! 		 * and so is unable to store any large values of dropped cols.)
! 		 *
! 		 * 2. The tuple might not even be legal for the new table; this is
! 		 * currently only known to happen as an after-effect of ALTER TABLE
! 		 * SET WITHOUT OIDS.
! 		 *
! 		 * So, we must reconstruct the tuple from component Datums.
! 		 */
! 		heap_deform_tuple(tuple, oldTupDesc, values, isnull);
! 
! 		/* Be sure to null out any dropped columns */
! 		for (i = 0; i < natts; i++)
  		{
! 			if (newTupDesc->attrs[i]->attisdropped)
! 				isnull[i] = true;
  		}
  
- 		copiedTuple = heap_form_tuple(newTupDesc, values, isnull);
  
! 		/* Preserve OID, if any */
! 		if (NewHeap->rd_rel->relhasoids)
! 			HeapTupleSetOid(copiedTuple, HeapTupleGetOid(tuple));
  
! 		/* The heap rewrite module does the rest */
! 		rewrite_heap_tuple(rwstate, tuple, copiedTuple);
  
! 		heap_freetuple(copiedTuple);
  	}
  
! 	if (OldIndex != NULL)
  		index_endscan(indexScan);
  	else
  		heap_endscan(heapScan);
  
  	/* Write out any remaining tuples, and fsync if needed */
  	end_heap_rewrite(rwstate);
--- 976,1029 ----
  			continue;
  		}
  
! 		if (!use_sort)
  		{
! 			deform_and_rewrite_tuple(tuple, oldTupDesc, newTupDesc, values, isnull,
! 										NewHeap->rd_rel->relhasoids, rwstate);
! 		}
! 		else
! 		{
! 			/* pass tuple to tuple store */
! 			tuplesort_putrawtuple(tuplesort, tuple);
  		}
  
  
! 	}
  
! 	if (use_sort)
! 	{
! 		tuplesort_performsort(tuplesort);
! 
! 		/* read from tuplestore */
! 		for (;;)
! 		{
! 			HeapTuple	tuple;
! 			bool        shouldfree;
  
! 			CHECK_FOR_INTERRUPTS();
! 
! 			tuple = tuplesort_getrawtuple(tuplesort, true, &shouldfree);
! 			if (tuple == NULL)
! 				break;
! 
! 			deform_and_rewrite_tuple(tuple, oldTupDesc, newTupDesc, values, isnull,
! 										NewHeap->rd_rel->relhasoids, rwstate);
! 			if (shouldfree)
! 				heap_freetuple(tuple);
! 		}
  	}
  
! 
! 	if (OldIndex != NULL && !use_sort)
! 	{
  		index_endscan(indexScan);
+ 	}
  	else
+ 	{
  		heap_endscan(heapScan);
+ 		if (use_sort)
+ 			tuplesort_end(tuplesort);
+ 	}
  
  	/* Write out any remaining tuples, and fsync if needed */
  	end_heap_rewrite(rwstate);
***************
*** 1237,1239 ****
--- 1266,1312 ----
  
  	return rvs;
  }
+ 
+ static void deform_and_rewrite_tuple(HeapTuple tuple, TupleDesc oldTupDesc, TupleDesc newTupDesc,
+ 										Datum *values, bool *isnull,
+ 										bool newRelHasOids, RewriteState rwstate)
+ {
+ 	HeapTuple	copiedTuple;
+ 	int 		i;
+ 
+ 	/*
+ 	 * We cannot simply copy the tuple as-is, for several reasons:
+ 	 *
+ 	 * 1. We'd like to squeeze out the values of any dropped columns, both
+ 	 * to save space and to ensure we have no corner-case failures. (It's
+ 	 * possible for example that the new table hasn't got a TOAST table
+ 	 * and so is unable to store any large values of dropped cols.)
+ 	 *
+ 	 * 2. The tuple might not even be legal for the new table; this is
+ 	 * currently only known to happen as an after-effect of ALTER TABLE
+ 	 * SET WITHOUT OIDS.
+ 	 *
+ 	 * So, we must reconstruct the tuple from component Datums.
+ 	 */
+ 
+ 	heap_deform_tuple(tuple, oldTupDesc, values, isnull);
+ 
+ 	/* Be sure to null out any dropped columns */
+ 	for (i = 0; i < newTupDesc->natts; i++)
+ 	{
+ 		if (newTupDesc->attrs[i]->attisdropped)
+ 			isnull[i] = true;
+ 	}
+ 
+ 	copiedTuple = heap_form_tuple(newTupDesc, values, isnull);
+ 
+ 	/* Preserve OID, if any */
+ 	if (newRelHasOids)
+ 		HeapTupleSetOid(copiedTuple, HeapTupleGetOid(tuple));
+ 
+ 	/* The heap rewrite module does the rest */
+ 	rewrite_heap_tuple(rwstate, tuple, copiedTuple);
+ 
+ 	heap_freetuple(copiedTuple);
+ }
+

#33

m_lists@yahoo.it

almost 16 years ago

In reply to: Leonardo F (#31)

1 attachment(s)

I: About "Our CLUSTER implementation is pessimal" patch

Not even a comment? As I said, performance results on my system
were very good....

Show quoted text

I know you're all very busy getting 9.0 out, but I think the results in
heap scanning + sort instead of index scanning for CLUSTER are
very good... I would like to know if I did something wrong/I should
improve something in the patch... I haven't tested it with index
expressions yet (but the tests in "make check" work).

Thanks

Leonardo

Hi all,

attached a patch to do seq scan + sorting instead of index scan

on CLUSTER (when that's supposed to be faster).

As I've already said, the patch is based on:
http://archives.postgresql.org/pgsql-hackers/2008-08/msg01371.php

Of course, the code isn't supposed to be ready to be merged: I
would like to write more comments and add some test cases to
cluster.sql (plus change all the things you are going to tell me I
have to change...)

I would like your opinions on code correctness and the decisions
I took, especially:

1) function names ("cost_index_scan_vs_seqscansort" I guess
it's awful...)

2) the fact that I put in Tuplesortstate an EState variable, so that
MakeSingleTupleTableSlot wouldn't have to be called for every
row in the expression indexes case

3) the expression index case is not "optimized": I preferred to
call FormIndexDatum once for the first key value in
copytup_rawheap and another time to get all the remaining values
in comparetup_rawheap. I liked the idea of re-using
FormIndexDatum in that case, instead of copying&pasting only
the relevant code: but FormIndexDatum returns all the values,

even when I might need only the first one

4) I refactored the code to deform and rewrite tuple into the function
"deform_and_rewrite_tuple", because now that part can be called
by the regular index scan or by the new seq-scan + sort (but I
could copy&paste those lines instead of refactoring them into a new
function)

Suggestions and comments are not just welcome, but needed!

Attachments:

sorted_cluster.patchapplication/octet-stream; name=sorted_cluster.patchDownload

Index: src/backend/optimizer/path/costsize.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/optimizer/path/costsize.c,v
retrieving revision 1.214
diff -c -r1.214 costsize.c
*** src/backend/optimizer/path/costsize.c	5 Jan 2010 21:53:58 -0000	1.214
--- src/backend/optimizer/path/costsize.c	28 Jan 2010 09:25:54 -0000
***************
*** 78,83 ****
--- 78,84 ----
  #include "optimizer/placeholder.h"
  #include "optimizer/planmain.h"
  #include "optimizer/restrictinfo.h"
+ #include "optimizer/plancat.h"
  #include "parser/parsetree.h"
  #include "utils/lsyscache.h"
  #include "utils/selfuncs.h"
***************
*** 2623,2628 ****
--- 2624,2700 ----
  								  (void *) context);
  }
  
+ /*
+  * cost_index_scan_vs_seqscansort
+  *		Estimate the CPU costs of evaluating an index scan and
+  *		a sequence scan + sort. This is needed by CLUSTER to
+  *		a decide the fastest path (index scan vs seq scan + sort).
+  *
+ */
+ void
+ cost_index_scan_vs_seqscansort(Oid tableOid, Oid indexOid,
+ 									Cost 	*indexScanTotCost,
+ 									Cost	*seqAndSortCost)
+ {
+ 	RelOptInfo 	*rel;
+ 	PlannerInfo *root;
+ 	Query 		*query;
+ 	PlannerGlobal *glob;
+ 	RangeTblEntry 	*rte;
+ 	ListCell   		*index;
+ 	IndexPath		*indexScanPath;
+ 	Path 			*seqAndSortPath;
+ 
+ 	rel = makeNode(RelOptInfo);
+ 	rel->reloptkind = RELOPT_BASEREL;
+ 	rel->relid = 1;
+ 	rel->rtekind = RTE_RELATION;
+ 
+ 	/* needed by get_relation_info */
+ 	glob = makeNode(PlannerGlobal);
+ 
+ 	/* needed by get_relation_info: */
+ 	query = makeNode(Query);
+ 	query->resultRelation = 0;
+ 
+ 	root = makeNode(PlannerInfo);
+ 
+ 	root->parse = query;
+ 	root->glob = glob;
+ 
+ 	get_relation_info(root, tableOid, false, rel);
+ 	seqAndSortPath = create_seqscan_path(NULL, rel);
+ 
+ 	rel->rows = rel->tuples;
+ 
+ 	rte = makeNode(RangeTblEntry);
+ 	rte->rtekind = RTE_RELATION;
+ 	rte->relid = tableOid;
+ 
+ 	root->simple_rel_array_size = 2;
+ 	root->simple_rte_array = (RangeTblEntry **)
+ 	palloc0(root->simple_rel_array_size * sizeof(RangeTblEntry *));
+ 	root->simple_rte_array[1] = rte;
+ 
+ 	root->total_table_pages = rel->pages;
+ 	foreach(index, rel->indexlist)
+ 	{
+ 		IndexOptInfo *indexInfo = (IndexOptInfo*)(index->data.ptr_value);
+ 		if (indexInfo->indexoid == indexOid)
+ 		{
+ 			indexScanPath = create_index_path(root, indexInfo, NULL, NULL, ForwardScanDirection, NULL);
+ 			break;
+ 		}
+ 	}
+ 
+ 	Assert(indexScanPath != NULL);
+ 
+ 	cost_sort(seqAndSortPath, root, NULL, seqAndSortPath->total_cost, rel->tuples, rel->width, -1);
+ 
+ 	*seqAndSortCost = seqAndSortPath->total_cost;
+ 	*indexScanTotCost = indexScanPath->path.total_cost;
+ }
+ 
  
  /*
   * adjust_semi_join
Index: src/backend/utils/sort/tuplesort.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/utils/sort/tuplesort.c,v
retrieving revision 1.94
diff -c -r1.94 tuplesort.c
*** src/backend/utils/sort/tuplesort.c	2 Jan 2010 16:57:58 -0000	1.94
--- src/backend/utils/sort/tuplesort.c	28 Jan 2010 09:25:55 -0000
***************
*** 104,109 ****
--- 104,110 ----
  #include "access/nbtree.h"
  #include "catalog/pg_amop.h"
  #include "catalog/pg_operator.h"
+ #include "catalog/index.h"
  #include "commands/tablespace.h"
  #include "miscadmin.h"
  #include "pg_trace.h"
***************
*** 115,126 ****
--- 116,129 ----
  #include "utils/rel.h"
  #include "utils/syscache.h"
  #include "utils/tuplesort.h"
+ #include "executor/executor.h"
  
  
  /* sort-type codes for sort__start probes */
  #define HEAP_SORT	0
  #define INDEX_SORT	1
  #define DATUM_SORT	2
+ #define RAWHEAP_SORT	3
  
  /* GUC variables */
  #ifdef TRACE_SORT
***************
*** 366,371 ****
--- 369,378 ----
  	int			datumTypeLen;
  	bool		datumTypeByVal;
  
+ 	/* These are specific to the rawheap subcase: */
+ 	EState 	   *estate;
+ 	IndexInfo  *indexInfo;
+ 
  	/*
  	 * Resource snapshot for time of sort start.
  	 */
***************
*** 450,455 ****
--- 457,468 ----
  static void readtup_heap(Tuplesortstate *state, SortTuple *stup,
  			 int tapenum, unsigned int len);
  static void reversedirection_heap(Tuplesortstate *state);
+ static int comparetup_rawheap(const SortTuple *a, const SortTuple *b,
+                              Tuplesortstate *state);
+ static void copytup_rawheap(Tuplesortstate *state, SortTuple *stup, void *tup);
+ static void writetup_rawheap(Tuplesortstate *state, int tapenum, SortTuple *stup);
+ static void readtup_rawheap(Tuplesortstate *state, SortTuple *stup, int tapenum,
+                             unsigned int len);
  static int comparetup_index_btree(const SortTuple *a, const SortTuple *b,
  					   Tuplesortstate *state);
  static int comparetup_index_hash(const SortTuple *a, const SortTuple *b,
***************
*** 549,554 ****
--- 562,570 ----
  
  	state->result_tape = -1;	/* flag that result tape has not been formed */
  
+ 	/* set estate to NULL, so we don't try to free it later if not used */
+ 	state->estate = NULL;
+ 
  	MemoryContextSwitchTo(oldcontext);
  
  	return state;
***************
*** 762,767 ****
--- 778,844 ----
  	return state;
  }
  
+ Tuplesortstate *
+ tuplesort_begin_rawheap(Relation indexRel,
+ 						TupleDesc tupDesc,
+                         int workMem, bool randomAccess)
+ {
+     Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+     MemoryContext oldcontext;
+ 
+     TupleTableSlot *existing_slot;
+     ExprContext	   *econtext;
+ 
+     Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
+ 
+     oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ 
+ #ifdef TRACE_SORT
+     if (trace_sort)
+         elog(LOG,
+              "begin tuple sort: nkeys = %d, workMem = %d, randomAccess = %c",
+              RelationGetNumberOfAttributes(indexRel), workMem, randomAccess ? 't' : 'f');
+ #endif
+ 
+ 	state->nKeys = RelationGetNumberOfAttributes(indexRel);
+ 
+ 	TRACE_POSTGRESQL_SORT_START(RAWHEAP_SORT, false, state->nKeys, workMem, randomAccess);
+ 
+     state->comparetup = comparetup_rawheap;
+     state->copytup = copytup_rawheap;
+     state->writetup = writetup_rawheap;
+     state->readtup = readtup_rawheap;
+     state->reversedirection = reversedirection_heap;
+ 
+ 	state->indexInfo = BuildIndexInfo(indexRel);
+ 	state->indexScanKey = _bt_mkscankey_nodata(indexRel);
+ 	state->enforceUnique = false;
+ 
+     state->tupDesc = tupDesc;    /* assume we need not copy tupDesc */
+ 
+     if (state->indexInfo->ii_Expressions != NULL)
+     {
+     	/* allocate the vars used by FormIndexDatum */
+ 		state->estate = CreateExecutorState();
+ 
+ 		/*
+ 		 * Need a TupleTableSlot to put existing tuples in.
+ 		 *
+ 		 * To use FormIndexDatum, we have to make the econtext's scantuple point
+ 		 * to this slot.  Be sure to save and restore caller's value for
+ 		 * scantuple.
+ 		 */
+ 		existing_slot = MakeSingleTupleTableSlot(tupDesc);
+ 
+ 		econtext = GetPerTupleExprContext(state->estate);
+ 		econtext->ecxt_scantuple = existing_slot;
+     }
+ 
+     MemoryContextSwitchTo(oldcontext);
+ 
+     return state;
+ }
+ 
  /*
   * tuplesort_set_bound
   *
***************
*** 849,854 ****
--- 926,934 ----
  	 */
  	TRACE_POSTGRESQL_SORT_DONE(state->tapeset != NULL, 0L);
  #endif
+ 	if (state->estate != NULL)
+ 		ExecDropSingleTupleTableSlot(GetPerTupleExprContext(state->estate)->ecxt_scantuple);
+ 
  
  	MemoryContextSwitchTo(oldcontext);
  
***************
*** 980,985 ****
--- 1060,1087 ----
  }
  
  /*
+ * Accept one tuple while collecting input data for sort.
+ *
+ * Note that the input data is always copied; the caller need not save it.
+ */
+ void
+ tuplesort_putrawtuple(Tuplesortstate *state, HeapTuple tup)
+ {
+     MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
+     SortTuple    stup;
+ 
+     /*
+      * Copy the given tuple into memory we control, and decrease availMem.
+      * Then call the common code.
+      */
+     COPYTUP(state, &stup, (void *) tup);
+ 
+     puttuple_common(state, &stup);
+ 
+     MemoryContextSwitchTo(oldcontext);
+ }
+ 
+ /*
   * Shared code for tuple and datum cases.
   */
  static void
***************
*** 1482,1487 ****
--- 1584,1609 ----
  }
  
  /*
+ * Fetch the next tuple in either forward or back direction.
+ * Returns NULL if no more tuples.    If *should_free is set, the
+ * caller must pfree the returned tuple when done with it.
+ */
+ HeapTuple
+ tuplesort_getrawtuple(Tuplesortstate *state, bool forward,
+                      bool *should_free)
+ {
+     MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
+     SortTuple    stup;
+ 
+     if (!tuplesort_gettuple_common(state, forward, &stup, should_free))
+         stup.tuple = NULL;
+ 
+     MemoryContextSwitchTo(oldcontext);
+ 
+     return stup.tuple;
+ }
+ 
+ /*
   * tuplesort_merge_order - report merge order we'll use for given memory
   * (note: "merge order" just means the number of input tapes in the merge).
   *
***************
*** 3079,3084 ****
--- 3201,3408 ----
  }
  
  /*
+ * Routines specialized for Raw on-disk HeapTuple case These are only used when
+ * we need the visibility info for things like CLUSTER. Otherwise we used the
+ * regular *tup_heap methods which actually manipulate MinimalTuples.
+ */
+ static int
+ comparetup_rawheap(const SortTuple *a, const SortTuple *b, Tuplesortstate *state)
+ {
+     ScanKey        scanKey = state->indexScanKey;
+     HeapTuple ltup;
+     HeapTuple rtup;
+     TupleDesc    tupDesc;
+     int            nkey;
+     int32        compare;
+ 
+     /* Allow interrupting long sorts */
+     CHECK_FOR_INTERRUPTS();
+ 
+     /* Compare the leading sort key */
+     compare = inlineApplySortFunction(&scanKey->sk_func, scanKey->sk_flags,
+                                      a->datum1, a->isnull1,
+                                      b->datum1, b->isnull1);
+     if (compare != 0)
+         return compare;
+ 
+     /* Compare additional sort keys */
+     ltup = (HeapTuple) a->tuple;
+     rtup = (HeapTuple) b->tuple;
+ 
+     if (state->indexInfo->ii_Expressions == NULL)
+     {
+         /* if not expression index, just get the proper heap_getattr */
+ 
+     	tupDesc = state->tupDesc;
+         scanKey++;
+ 
+         for (nkey = 1; nkey < state->nKeys; nkey++, scanKey++)
+         {
+             Datum       datum1,
+                         datum2;
+             bool        isnull1,
+                         isnull2;
+ 
+             datum1 = heap_getattr(ltup, state->indexInfo->ii_KeyAttrNumbers[nkey], tupDesc, &isnull1);
+             datum2 = heap_getattr(rtup, state->indexInfo->ii_KeyAttrNumbers[nkey], tupDesc, &isnull2);
+ 
+             compare = inlineApplySortFunction(&scanKey->sk_func, scanKey->sk_flags,
+                                              datum1, isnull1,
+                                              datum2, isnull2);
+             if (compare != 0)
+                 return compare;
+         }
+     }
+     else
+     {
+     	/* in the expression index case, we get all the values/nulls:
+     	 * it would be faster to get only the required ones, but it would mean
+     	 * copy&paste from FormIndexDatum
+     	*/
+ 
+ 		Datum		l_existing_values[INDEX_MAX_KEYS];
+ 		bool		l_existing_isnull[INDEX_MAX_KEYS];
+ 		Datum		r_existing_values[INDEX_MAX_KEYS];
+ 		bool		r_existing_isnull[INDEX_MAX_KEYS];
+ 		TupleTableSlot *ecxt_scantuple;
+ 
+ 		ecxt_scantuple = GetPerTupleExprContext(state->estate)->ecxt_scantuple;
+ 
+ 		ExecStoreTuple(ltup,	ecxt_scantuple, InvalidBuffer, false);
+     	FormIndexDatum(state->indexInfo, ecxt_scantuple, state->estate,
+ 						l_existing_values, l_existing_isnull);
+ 
+ 		ExecStoreTuple(rtup,	ecxt_scantuple, InvalidBuffer, false);
+     	FormIndexDatum(state->indexInfo, ecxt_scantuple, state->estate,
+ 						r_existing_values, r_existing_isnull);
+ 
+     	for (nkey = 1; nkey < state->nKeys; nkey++, scanKey++)
+         {
+             compare = inlineApplySortFunction(&scanKey->sk_func,
+ 												scanKey->sk_flags,
+ 												l_existing_values[nkey],
+ 												l_existing_isnull[nkey],
+ 												r_existing_values[nkey],
+ 												r_existing_isnull[nkey]);
+ 
+             if (compare != 0)
+                 return compare;
+ 
+         }
+     }
+ 
+ 
+ 
+     return 0;
+ }
+ 
+ static void
+ copytup_rawheap(Tuplesortstate *state, SortTuple *stup, void *tup)
+ {
+     HeapTuple    tuple = (HeapTuple) tup;
+ 
+     /* copy the tuple into sort storage */
+     stup->tuple = (void *) heap_copytuple(tuple);
+     USEMEM(state, GetMemoryChunkSpace(stup->tuple));
+     /* set up first-column key value */
+     if (state->indexInfo->ii_Expressions == NULL)
+     {
+     	/* no expression index, just get the key datum value */
+ 		stup->datum1 = heap_getattr((HeapTuple) stup->tuple,
+ 									state->indexInfo->ii_KeyAttrNumbers[0],
+ 									state->tupDesc,
+ 									&stup->isnull1);
+     }
+     else
+     {
+     	/*
+     	 * Extract the index column values and isnull flags from the existing
+     	 * tuple; we're interested only in the very first one, but to avoid
+     	 * copy&paste from FormIndexDatum we get all of them (even if it's
+     	 * slower)
+ 		*/
+ 
+ 		Datum		existing_values[INDEX_MAX_KEYS];
+ 		bool		existing_isnull[INDEX_MAX_KEYS];
+ 		TupleTableSlot *ecxt_scantuple;
+ 
+ 		ecxt_scantuple = GetPerTupleExprContext(state->estate)->ecxt_scantuple;
+     	ExecStoreTuple(tuple,	ecxt_scantuple, InvalidBuffer, false);
+     	FormIndexDatum(state->indexInfo, ecxt_scantuple, state->estate,
+     				   existing_values, existing_isnull);
+ 
+     	stup->datum1 = existing_values[0];
+     	stup->isnull1 = existing_isnull[0];
+     }
+ }
+ 
+ static void
+ writetup_rawheap(Tuplesortstate *state, int tapenum, SortTuple *stup)
+ {
+     HeapTuple    tuple = (HeapTuple) stup->tuple;
+     tuple->t_len += HEAPTUPLESIZE; /* write out the header as well */
+ 
+     LogicalTapeWrite(state->tapeset, tapenum,
+                      tuple, tuple->t_len);
+ 
+     if (state->randomAccess)    /* need trailing length word? */
+         LogicalTapeWrite(state->tapeset, tapenum,
+                          tuple, sizeof(tuple->t_len));
+ 
+     FREEMEM(state, GetMemoryChunkSpace(tuple));
+     heap_freetuple(tuple);
+ }
+ 
+ static void
+ readtup_rawheap(Tuplesortstate *state, SortTuple *stup,
+              int tapenum, unsigned int tuplen)
+ {
+     HeapTuple    tuple = (HeapTuple) palloc(tuplen);
+ 
+     USEMEM(state, GetMemoryChunkSpace(tuple));
+ 
+     tuple->t_len = tuplen - HEAPTUPLESIZE;
+     if (LogicalTapeRead(state->tapeset, tapenum, &tuple->t_self, tuplen-sizeof(tuplen)) != tuplen-sizeof(tuplen))
+         elog(ERROR, "unexpected end of data");
+     if (state->randomAccess)    /* need trailing length word? */
+         if (LogicalTapeRead(state->tapeset, tapenum, &tuplen,
+                             sizeof(tuplen)) != sizeof(tuplen))
+             elog(ERROR, "unexpected end of data");
+ 
+     stup->tuple = tuple;
+     /* set up first-column key value */
+     if (state->indexInfo->ii_Expressions == NULL)
+     {
+     	/* no expression index, just get the key datum value */
+ 		stup->datum1 = heap_getattr((HeapTuple) stup->tuple,
+ 									state->indexInfo->ii_KeyAttrNumbers[0],
+ 									state->tupDesc,
+ 									&stup->isnull1);
+     }
+     else
+     {
+     	/*
+     	 * Extract the index column values and isnull flags from the existing
+     	 * tuple; we're interested only in the very first one, but to avoid
+     	 * copy&paste from FormIndexDatum we get all of them (even if it's
+     	 * slower)
+ 		*/
+ 
+ 		Datum		existing_values[INDEX_MAX_KEYS];
+ 		bool		existing_isnull[INDEX_MAX_KEYS];
+ 		TupleTableSlot *ecxt_scantuple;
+ 
+ 		ecxt_scantuple = GetPerTupleExprContext(state->estate)->ecxt_scantuple;
+     	ExecStoreTuple(tuple,	ecxt_scantuple, InvalidBuffer, false);
+     	FormIndexDatum(state->indexInfo, ecxt_scantuple, state->estate,
+     				   existing_values, existing_isnull);
+ 
+     	stup->datum1 = existing_values[0];
+     	stup->isnull1 = existing_isnull[0];
+     }
+ }
+ 
+ /*
   * Convenience routine to free a tuple previously loaded into sort memory
   */
  static void
Index: src/include/optimizer/cost.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/optimizer/cost.h,v
retrieving revision 1.100
diff -c -r1.100 cost.h
*** src/include/optimizer/cost.h	2 Jan 2010 16:58:07 -0000	1.100
--- src/include/optimizer/cost.h	28 Jan 2010 09:25:55 -0000
***************
*** 109,114 ****
--- 109,118 ----
  extern void cost_subplan(PlannerInfo *root, SubPlan *subplan, Plan *plan);
  extern void cost_qual_eval(QualCost *cost, List *quals, PlannerInfo *root);
  extern void cost_qual_eval_node(QualCost *cost, Node *qual, PlannerInfo *root);
+ extern void cost_index_scan_vs_seqscansort(Oid tableOid, Oid indexOid,
+ 											Cost 	*indexScanTotCost,
+ 											Cost	*seqAndSortCost);
+ 
  extern void set_baserel_size_estimates(PlannerInfo *root, RelOptInfo *rel);
  extern void set_joinrel_size_estimates(PlannerInfo *root, RelOptInfo *rel,
  						   RelOptInfo *outer_rel,
Index: src/include/utils/tuplesort.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/utils/tuplesort.h,v
retrieving revision 1.35
diff -c -r1.35 tuplesort.h
*** src/include/utils/tuplesort.h	2 Jan 2010 16:58:10 -0000	1.35
--- src/include/utils/tuplesort.h	28 Jan 2010 09:25:55 -0000
***************
*** 64,69 ****
--- 64,72 ----
  extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
  					  Oid sortOperator, bool nullsFirstFlag,
  					  int workMem, bool randomAccess);
+ extern Tuplesortstate *tuplesort_begin_rawheap(Relation indexRel,
+ 												TupleDesc tupDesc,
+ 												int workMem, bool randomAccess);
  
  extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
  
***************
*** 72,77 ****
--- 75,81 ----
  extern void tuplesort_putindextuple(Tuplesortstate *state, IndexTuple tuple);
  extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
  				   bool isNull);
+ extern void tuplesort_putrawtuple(Tuplesortstate *state, HeapTuple tup);
  
  extern void tuplesort_performsort(Tuplesortstate *state);
  
***************
*** 81,86 ****
--- 85,92 ----
  						bool *should_free);
  extern bool tuplesort_getdatum(Tuplesortstate *state, bool forward,
  				   Datum *val, bool *isNull);
+ extern HeapTuple tuplesort_getrawtuple(Tuplesortstate *state, bool forward,
+                                      bool *should_free);
  
  extern void tuplesort_end(Tuplesortstate *state);
  
Index: src/backend/commands/cluster.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/commands/cluster.c,v
retrieving revision 1.194
diff -c -r1.194 cluster.c
*** src/backend/commands/cluster.c	20 Jan 2010 19:43:40 -0000	1.194
--- src/backend/commands/cluster.c	28 Jan 2010 09:25:53 -0000
***************
*** 47,53 ****
  #include "utils/snapmgr.h"
  #include "utils/syscache.h"
  #include "utils/tqual.h"
! 
  
  /*
   * This struct is used to pass around the information on tables to be
--- 47,57 ----
  #include "utils/snapmgr.h"
  #include "utils/syscache.h"
  #include "utils/tqual.h"
! #include "utils/tuplesort.h"
! #include "optimizer/plancat.h"
! #include "optimizer/pathnode.h"
! #include "optimizer/cost.h"
! #include "executor/spi_priv.h"
  
  /*
   * This struct is used to pass around the information on tables to be
***************
*** 66,73 ****
  static TransactionId copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap,
  					Oid OIDOldIndex, int freeze_min_age, int freeze_table_age);
  static List *get_tables_to_cluster(MemoryContext cluster_context);
! 
! 
  
  /*---------------------------------------------------------------------------
   * This cluster code allows for clustering multiple tables at once. Because
--- 70,78 ----
  static TransactionId copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap,
  					Oid OIDOldIndex, int freeze_min_age, int freeze_table_age);
  static List *get_tables_to_cluster(MemoryContext cluster_context);
! static void deform_and_rewrite_tuple(HeapTuple tuple, TupleDesc oldTupDesc, TupleDesc newTupDesc,
! 										Datum *values, bool *isnull,
! 										bool newRelHasOids, RewriteState rwstate);
  
  /*---------------------------------------------------------------------------
   * This cluster code allows for clustering multiple tables at once. Because
***************
*** 791,796 ****
--- 796,803 ----
  	TransactionId OldestXmin;
  	TransactionId FreezeXid;
  	RewriteState rwstate;
+ 	bool 		 use_sort = false;
+ 	Tuplesortstate *tuplesort;
  
  	/*
  	 * Open the relations we need.
***************
*** 860,866 ****
  	 * copied, we scan with SnapshotAny and use HeapTupleSatisfiesVacuum for
  	 * the visibility test.
  	 */
! 	if (OldIndex != NULL)
  	{
  		heapScan = NULL;
  		indexScan = index_beginscan(OldHeap, OldIndex,
--- 867,883 ----
  	 * copied, we scan with SnapshotAny and use HeapTupleSatisfiesVacuum for
  	 * the visibility test.
  	 */
! 	if (OldIndex != NULL && (OldIndex->rd_rel->relam == BTREE_AM_OID))
! 	{
! 		Cost 	indexScanTotCost,
! 				seqAndSortCost;
! 		cost_index_scan_vs_seqscansort(OIDOldHeap, OIDOldIndex,
! 													&indexScanTotCost,
! 													&seqAndSortCost);
! 		use_sort = seqAndSortCost < indexScanTotCost;
! 	}
! 
! 	if (OldIndex != NULL && !use_sort)
  	{
  		heapScan = NULL;
  		indexScan = index_beginscan(OldHeap, OldIndex,
***************
*** 869,888 ****
  	else
  	{
  		heapScan = heap_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL);
  		indexScan = NULL;
  	}
  
  	for (;;)
  	{
  		HeapTuple	tuple;
- 		HeapTuple	copiedTuple;
  		Buffer		buf;
! 		bool		isdead;
! 		int			i;
  
  		CHECK_FOR_INTERRUPTS();
  
! 		if (OldIndex != NULL)
  		{
  			tuple = index_getnext(indexScan, ForwardScanDirection);
  			if (tuple == NULL)
--- 886,909 ----
  	else
  	{
  		heapScan = heap_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL);
+ 		if (use_sort)
+ 		{
+ 			tuplesort = tuplesort_begin_rawheap(OldIndex, oldTupDesc,
+ 		                                        maintenance_work_mem, false);
+ 		}
  		indexScan = NULL;
  	}
  
  	for (;;)
  	{
+ 
  		HeapTuple	tuple;
  		Buffer		buf;
! 		bool isdead;
  
  		CHECK_FOR_INTERRUPTS();
  
! 		if (OldIndex != NULL && !use_sort)
  		{
  			tuple = index_getnext(indexScan, ForwardScanDirection);
  			if (tuple == NULL)
***************
*** 902,908 ****
  
  			buf = heapScan->rs_cbuf;
  		}
- 
  		LockBuffer(buf, BUFFER_LOCK_SHARE);
  
  		switch (HeapTupleSatisfiesVacuum(tuple->t_data, OldestXmin, buf))
--- 923,928 ----
***************
*** 956,1000 ****
  			continue;
  		}
  
! 		/*
! 		 * We cannot simply copy the tuple as-is, for several reasons:
! 		 *
! 		 * 1. We'd like to squeeze out the values of any dropped columns, both
! 		 * to save space and to ensure we have no corner-case failures. (It's
! 		 * possible for example that the new table hasn't got a TOAST table
! 		 * and so is unable to store any large values of dropped cols.)
! 		 *
! 		 * 2. The tuple might not even be legal for the new table; this is
! 		 * currently only known to happen as an after-effect of ALTER TABLE
! 		 * SET WITHOUT OIDS.
! 		 *
! 		 * So, we must reconstruct the tuple from component Datums.
! 		 */
! 		heap_deform_tuple(tuple, oldTupDesc, values, isnull);
! 
! 		/* Be sure to null out any dropped columns */
! 		for (i = 0; i < natts; i++)
  		{
! 			if (newTupDesc->attrs[i]->attisdropped)
! 				isnull[i] = true;
  		}
  
- 		copiedTuple = heap_form_tuple(newTupDesc, values, isnull);
  
! 		/* Preserve OID, if any */
! 		if (NewHeap->rd_rel->relhasoids)
! 			HeapTupleSetOid(copiedTuple, HeapTupleGetOid(tuple));
  
! 		/* The heap rewrite module does the rest */
! 		rewrite_heap_tuple(rwstate, tuple, copiedTuple);
  
! 		heap_freetuple(copiedTuple);
  	}
  
! 	if (OldIndex != NULL)
  		index_endscan(indexScan);
  	else
  		heap_endscan(heapScan);
  
  	/* Write out any remaining tuples, and fsync if needed */
  	end_heap_rewrite(rwstate);
--- 976,1029 ----
  			continue;
  		}
  
! 		if (!use_sort)
  		{
! 			deform_and_rewrite_tuple(tuple, oldTupDesc, newTupDesc, values, isnull,
! 										NewHeap->rd_rel->relhasoids, rwstate);
! 		}
! 		else
! 		{
! 			/* pass tuple to tuple store */
! 			tuplesort_putrawtuple(tuplesort, tuple);
  		}
  
  
! 	}
  
! 	if (use_sort)
! 	{
! 		tuplesort_performsort(tuplesort);
! 
! 		/* read from tuplestore */
! 		for (;;)
! 		{
! 			HeapTuple	tuple;
! 			bool        shouldfree;
  
! 			CHECK_FOR_INTERRUPTS();
! 
! 			tuple = tuplesort_getrawtuple(tuplesort, true, &shouldfree);
! 			if (tuple == NULL)
! 				break;
! 
! 			deform_and_rewrite_tuple(tuple, oldTupDesc, newTupDesc, values, isnull,
! 										NewHeap->rd_rel->relhasoids, rwstate);
! 			if (shouldfree)
! 				heap_freetuple(tuple);
! 		}
  	}
  
! 
! 	if (OldIndex != NULL && !use_sort)
! 	{
  		index_endscan(indexScan);
+ 	}
  	else
+ 	{
  		heap_endscan(heapScan);
+ 		if (use_sort)
+ 			tuplesort_end(tuplesort);
+ 	}
  
  	/* Write out any remaining tuples, and fsync if needed */
  	end_heap_rewrite(rwstate);
***************
*** 1237,1239 ****
--- 1266,1312 ----
  
  	return rvs;
  }
+ 
+ static void deform_and_rewrite_tuple(HeapTuple tuple, TupleDesc oldTupDesc, TupleDesc newTupDesc,
+ 										Datum *values, bool *isnull,
+ 										bool newRelHasOids, RewriteState rwstate)
+ {
+ 	HeapTuple	copiedTuple;
+ 	int 		i;
+ 
+ 	/*
+ 	 * We cannot simply copy the tuple as-is, for several reasons:
+ 	 *
+ 	 * 1. We'd like to squeeze out the values of any dropped columns, both
+ 	 * to save space and to ensure we have no corner-case failures. (It's
+ 	 * possible for example that the new table hasn't got a TOAST table
+ 	 * and so is unable to store any large values of dropped cols.)
+ 	 *
+ 	 * 2. The tuple might not even be legal for the new table; this is
+ 	 * currently only known to happen as an after-effect of ALTER TABLE
+ 	 * SET WITHOUT OIDS.
+ 	 *
+ 	 * So, we must reconstruct the tuple from component Datums.
+ 	 */
+ 
+ 	heap_deform_tuple(tuple, oldTupDesc, values, isnull);
+ 
+ 	/* Be sure to null out any dropped columns */
+ 	for (i = 0; i < newTupDesc->natts; i++)
+ 	{
+ 		if (newTupDesc->attrs[i]->attisdropped)
+ 			isnull[i] = true;
+ 	}
+ 
+ 	copiedTuple = heap_form_tuple(newTupDesc, values, isnull);
+ 
+ 	/* Preserve OID, if any */
+ 	if (newRelHasOids)
+ 		HeapTupleSetOid(copiedTuple, HeapTupleGetOid(tuple));
+ 
+ 	/* The heap rewrite module does the rest */
+ 	rewrite_heap_tuple(rwstate, tuple, copiedTuple);
+ 
+ 	heap_freetuple(copiedTuple);
+ }
+

#34

schmiddy@gmail.com

almost 16 years ago

In reply to: Leonardo F (#33)

Re: I: About "Our CLUSTER implementation is pessimal" patch

On Tue, Feb 9, 2010 at 5:49 AM, Leonardo F <m_lists@yahoo.it> wrote:

Not even a comment? As I said, performance results on my system
were very good....

Hi Leonardo,
Perhaps you could supply a .sql file containing a testcase illustrating the
performance benefits you tested with your patch -- as I understand, the sole
purpose of this patch is to speed up CLUSTER -- along with your results? The
existing src/test/regress/sql/cluster.sql looks like it only has some basic
sanity checks in it, and not performance tests. An idea of what hardware
you're testing on and any tweaks to postgresql.conf might be useful as well.

I was able to apply your patch cleanly and run CLUSTER with it, and "make
check" passed for me on OS X as well.

Hope this helps, and sorry I'm not able to offer more technical advice on
your patch.
Josh

#35

http://archives.postgresql.org/pgsql-hackers/2008-08/msg01371.php

m_lists@yahoo.it

almost 16 years ago

In reply to: Josh Kupershmidt (#34)

Re: I: About "Our CLUSTER implementation is pessimal" patch

Hi all,

while testing the seq scan + sort CLUSTER implementation, I've found a
bug in write/readtup_rawheap.

The functions are needed by the sort part.
The code in

didn't copy the whole tuple, but only the HeapTuple "header": the t_data
part wasn't copied (at least, that's my understanding), The original code
is at the bottom of the email. It didn't work (the table wasn't fully clustered
at the end of the CLUSTER command).

So I modified write/readtup_rawheap: they are supposed to store/retrieve
both the "fixed" part of HeapTupleData, plus the "variable" part t_data.
But a get a lot of:
WARNING: problem in alloc set TupleSort: detected write past chunk end
in block 0x96853f0, chunk 0x968723c
WARNING: problem in alloc set TupleSort: detected write past chunk end
in block 0x96853f0, chunk 0x96872c8
warnings when calling "tuplesort_end" and some of the data gets garbled
after the CLUSTER command.

What am I doing wrong? Using the debugger, data coming out of
readtup_rawheap seems fine...

I *really* need your help here...

static void
writetup_rawheap(Tuplesortstate *state, int tapenum, SortTuple *stup)
{
HeapTuple tuple = (HeapTuple) stup->tuple;
tuple->t_len += sizeof(HeapTupleData); /* write out the header as well */

LogicalTapeWrite(state->tapeset, tapenum,
tuple, sizeof(HeapTupleData));
LogicalTapeWrite(state->tapeset, tapenum, tuple->t_data,
tuple->t_len-sizeof(HeapTupleData));
if (state->randomAccess) /* need trailing length word? */
LogicalTapeWrite(state->tapeset, tapenum,
tuple, sizeof(tuple->t_len));

FREEMEM(state, GetMemoryChunkSpace(tuple));
heap_freetuple(tuple);
}

static void
readtup_rawheap(Tuplesortstate *state, SortTuple *stup,
int tapenum, unsigned int tuplen)
{
HeapTuple tuple = (HeapTuple) palloc(sizeof(HeapTupleData));
tuple->t_data = (HeapTupleHeader) palloc(tuplen-sizeof(HeapTupleData));

USEMEM(state, GetMemoryChunkSpace(tuple));

tuple->t_len = tuplen - sizeof(HeapTupleData);
if (LogicalTapeRead(state->tapeset, tapenum, &tuple->t_self, sizeof(HeapTupleData)-sizeof(tuplen))
!= sizeof(HeapTupleData)-sizeof(tuplen))
elog(ERROR, "unexpected end of data");
if (LogicalTapeRead(state->tapeset, tapenum, tuple->t_data, tuple->t_len) != tuple->t_len)
elog(ERROR, "unexpected end of data");
if (state->randomAccess) /* need trailing length word? */
if (LogicalTapeRead(state->tapeset, tapenum, &tuplen,
sizeof(tuplen)) != sizeof(tuplen))
elog(ERROR, "unexpected end of data");

stup->tuple = tuple;
/* set up first-column key value */
if (state->indexInfo->ii_Expressions == NULL)
{
/* no expression index, just get the key datum value */
stup->datum1 = heap_getattr((HeapTuple) stup->tuple,
state->indexInfo->ii_KeyAttrNumbers[0],
state->tupDesc,
&stup->isnull1);
}
else
{
[...] expression index part, removed for clarity
}
}

Original code:

static void
writetup_rawheap(Tuplesortstate *state, int tapenum, SortTuple *stup)
{
HeapTuple tuple = (HeapTuple) stup->tuple;
tuple->t_len += HEAPTUPLESIZE; /* write out the header as well */

LogicalTapeWrite(state->tapeset, tapenum,
tuple, tuple->t_len);

if (state->randomAccess) /* need trailing length word? */
LogicalTapeWrite(state->tapeset, tapenum,
tuple, sizeof(tuple->t_len));

FREEMEM(state, GetMemoryChunkSpace(tuple));
heap_freetuple(tuple);
}

static void
readtup_rawheap(Tuplesortstate *state, SortTuple *stup,
int tapenum, unsigned int tuplen)
{
HeapTuple tuple = (HeapTuple) palloc(tuplen);

USEMEM(state, GetMemoryChunkSpace(tuple));

tuple->t_len = tuplen - HEAPTUPLESIZE;
if (LogicalTapeRead(state->tapeset, tapenum, &tuple->t_self,
tuplen-sizeof(tuplen)) != tuplen-sizeof(tuplen))
elog(ERROR, "unexpected end of data");
if (state->randomAccess) /* need trailing length word? */
if (LogicalTapeRead(state->tapeset, tapenum, &tuplen,
sizeof(tuplen)) != sizeof(tuplen))
elog(ERROR, "unexpected end of data");

stup->tuple = tuple;
/* set up first-column key value */
stup->datum1 = heap_getattr(tuple,
state->scanKeys[0].sk_attno,
state->tupDesc,
&stup->isnull1);
}

#36

heikki.linnakangas@enterprisedb.com

almost 16 years ago

In reply to: Leonardo F (#35)

Re: I: About "Our CLUSTER implementation is pessimal" patch

Leonardo F wrote:

static void
writetup_rawheap(Tuplesortstate *state, int tapenum, SortTuple *stup)
{
HeapTuple tuple = (HeapTuple) stup->tuple;

I think you're confusing HeapTuple and HeapTupleHeader. SortTuple->tuple
field should point to a HeapTupleHeader, not a HeapTuple.

The right mental model is that HeapTupleHeader is a physical tuple, one
that you store on disk for example. A HeapTuple is an in-memory wrapper,
or pointer if you will, to a HeapTupleHeader, and holds some additional
information on the tuple that's useful when operating on it.

To add to the confusion, MinimalTuple is a shortened counterpart of
HeapTuple*Header*, not HeapTuple. And SortTuple is an in-memory wrapper
similar to HeapTuple, containing additional information on the tuple
that helps with sorting.

I didn't look at the rest of the code in detail, but I think that's
where your problems are stemming from.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#37

m_lists@yahoo.it

almost 16 years ago

In reply to: Heikki Linnakangas (#36)

Re: I: About "Our CLUSTER implementation is pessimal" patch

I think you're confusing HeapTuple and HeapTupleHeader. SortTuple->tuple
field should point to a HeapTupleHeader, not a HeapTuple.

Mmh, I don't get it: that would mean I might be using more info than required,
but I still don't understand why it's not working... at the end, CLUSTER is going
to need full HeapTuple(s) (if I'm not mistaken).
SortTuple->tuple can be any of MinimalTuple, IndexTuple, even a Datum.
So I added my own read/write_rawtuple (I'm not that good, they were in the
original patch and I copy&pasted them).

I can write and read those tuples (using some Asserts I think everything goes

as expected in write/readtup_rawheap). But when calling tuplesort_getrawtuple,
after some "right" tuples, the call to tuplesort_gettuple_common fails because
the lines:

/* pull next preread tuple from list, insert in heap */
newtup = &state->memtuples[tupIndex];

give a wrong initialized "newtup" (in other words, newtup is random memory).
Since write/readtup_rawheap seems fine, I can't understand what's going on...

I write the HeapTuples with:

(in writetup_rawheap):
HeapTuple tuple = (HeapTuple) stup->tuple;
tuple->t_len += HEAPTUPLESIZE; /* write out the header as well */

LogicalTapeWrite(state->tapeset, tapenum,
tuple, HEAPTUPLESIZE);
LogicalTapeWrite(state->tapeset, tapenum, tuple->t_data,
tuple->t_len-HEAPTUPLESIZE);

then I read them with:
(in readtup_rawheap):
HeapTuple tuple = (HeapTuple) palloc(tuplen);
USEMEM(state, GetMemoryChunkSpace(tuple));

tuple->t_len = tuplen - HEAPTUPLESIZE;
if (LogicalTapeRead(state->tapeset, tapenum, &tuple->t_self,
HEAPTUPLESIZE-sizeof(tuplen)) != HEAPTUPLESIZE-sizeof(tuplen))
elog(ERROR, "unexpected end of data");
if (LogicalTapeRead(state->tapeset, tapenum, tuple->t_data, tuple->t_len) != tuple->t_len)
elog(ERROR, "unexpected end of data");

#38

m_lists@yahoo.it

almost 16 years ago

In reply to: Heikki Linnakangas (#36)

Re: I: About "Our CLUSTER implementation is pessimal" patch

I think I've found the problem:

tuple->t_data wasn't at HEAPTUPLESIZE distance from tuple.
I guess the code makes that assumption somewhere, so I had
to do:

tuple->t_data = (HeapTupleHeader) ((char *) tuple +
HEAPTUPLESIZE);

Now that test works! Hope I don't find any more problems...

Leonardo

#39

m_lists@yahoo.it

almost 16 years ago

In reply to: Josh Kupershmidt (#34)

2 attachment(s)

Re: I: About "Our CLUSTER implementation is pessimal" patch

Perhaps you could supply a .sql file containing a testcase
illustrating the performance benefits you tested with your patch

Sure.

Attached the updated patch (should solve a bug) and a script.
The sql scripts generates a 2M rows table ("orig"); then the
table is copied and the copy clustered using seq + sort (since
"set enable_seqscan=false;").
Then the table "orig" is copied again, and the copy clustered
using regular index scan (set enable_indexscan=true; set
enable_seqscan=false).
Then the same thing is done on a 5M rows table, and on a 10M
rows table.

On my system (Sol10 on a dual Opteron 2.8) single disc:

2M: seq+sort 11secs; regular index scan: 33secs
5M: seq+sort 39secs; regular index scan: 105secs
10M:seq+sort 83secs; regular index scan: 646secs

Maybe someone could suggest a better/different test?

Leonardo

Attachments:

sorted_cluster20100210.patchapplication/octet-stream; name=sorted_cluster20100210.patchDownload

Index: src/backend/optimizer/path/costsize.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/optimizer/path/costsize.c,v
retrieving revision 1.214
diff -c -r1.214 costsize.c
*** src/backend/optimizer/path/costsize.c	5 Jan 2010 21:53:58 -0000	1.214
--- src/backend/optimizer/path/costsize.c	10 Feb 2010 17:03:51 -0000
***************
*** 78,83 ****
--- 78,84 ----
  #include "optimizer/placeholder.h"
  #include "optimizer/planmain.h"
  #include "optimizer/restrictinfo.h"
+ #include "optimizer/plancat.h"
  #include "parser/parsetree.h"
  #include "utils/lsyscache.h"
  #include "utils/selfuncs.h"
***************
*** 2623,2628 ****
--- 2624,2700 ----
  								  (void *) context);
  }
  
+ /*
+  * cost_index_scan_vs_seqscansort
+  *		Estimate the CPU costs of evaluating an index scan and
+  *		a sequence scan + sort. This is needed by CLUSTER to
+  *		a decide the fastest path (index scan vs seq scan + sort).
+  *
+ */
+ void
+ cost_index_scan_vs_seqscansort(Oid tableOid, Oid indexOid,
+ 									Cost 	*indexScanTotCost,
+ 									Cost	*seqAndSortCost)
+ {
+ 	RelOptInfo 	*rel;
+ 	PlannerInfo *root;
+ 	Query 		*query;
+ 	PlannerGlobal *glob;
+ 	RangeTblEntry 	*rte;
+ 	ListCell   		*index;
+ 	IndexPath		*indexScanPath;
+ 	Path 			*seqAndSortPath;
+ 
+ 	rel = makeNode(RelOptInfo);
+ 	rel->reloptkind = RELOPT_BASEREL;
+ 	rel->relid = 1;
+ 	rel->rtekind = RTE_RELATION;
+ 
+ 	/* needed by get_relation_info */
+ 	glob = makeNode(PlannerGlobal);
+ 
+ 	/* needed by get_relation_info: */
+ 	query = makeNode(Query);
+ 	query->resultRelation = 0;
+ 
+ 	root = makeNode(PlannerInfo);
+ 
+ 	root->parse = query;
+ 	root->glob = glob;
+ 
+ 	get_relation_info(root, tableOid, false, rel);
+ 	seqAndSortPath = create_seqscan_path(NULL, rel);
+ 
+ 	rel->rows = rel->tuples;
+ 
+ 	rte = makeNode(RangeTblEntry);
+ 	rte->rtekind = RTE_RELATION;
+ 	rte->relid = tableOid;
+ 
+ 	root->simple_rel_array_size = 2;
+ 	root->simple_rte_array = (RangeTblEntry **)
+ 	palloc0(root->simple_rel_array_size * sizeof(RangeTblEntry *));
+ 	root->simple_rte_array[1] = rte;
+ 
+ 	root->total_table_pages = rel->pages;
+ 	foreach(index, rel->indexlist)
+ 	{
+ 		IndexOptInfo *indexInfo = (IndexOptInfo*)(index->data.ptr_value);
+ 		if (indexInfo->indexoid == indexOid)
+ 		{
+ 			indexScanPath = create_index_path(root, indexInfo, NULL, NULL, ForwardScanDirection, NULL);
+ 			break;
+ 		}
+ 	}
+ 
+ 	Assert(indexScanPath != NULL);
+ 
+ 	cost_sort(seqAndSortPath, root, NULL, seqAndSortPath->total_cost, rel->tuples, rel->width, -1);
+ 
+ 	*seqAndSortCost = seqAndSortPath->total_cost;
+ 	*indexScanTotCost = indexScanPath->path.total_cost;
+ }
+ 
  
  /*
   * adjust_semi_join
Index: src/backend/utils/sort/tuplesort.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/utils/sort/tuplesort.c,v
retrieving revision 1.94
diff -c -r1.94 tuplesort.c
*** src/backend/utils/sort/tuplesort.c	2 Jan 2010 16:57:58 -0000	1.94
--- src/backend/utils/sort/tuplesort.c	10 Feb 2010 17:03:52 -0000
***************
*** 104,109 ****
--- 104,110 ----
  #include "access/nbtree.h"
  #include "catalog/pg_amop.h"
  #include "catalog/pg_operator.h"
+ #include "catalog/index.h"
  #include "commands/tablespace.h"
  #include "miscadmin.h"
  #include "pg_trace.h"
***************
*** 115,126 ****
--- 116,129 ----
  #include "utils/rel.h"
  #include "utils/syscache.h"
  #include "utils/tuplesort.h"
+ #include "executor/executor.h"
  
  
  /* sort-type codes for sort__start probes */
  #define HEAP_SORT	0
  #define INDEX_SORT	1
  #define DATUM_SORT	2
+ #define RAWHEAP_SORT	3
  
  /* GUC variables */
  #ifdef TRACE_SORT
***************
*** 366,371 ****
--- 369,378 ----
  	int			datumTypeLen;
  	bool		datumTypeByVal;
  
+ 	/* These are specific to the rawheap subcase: */
+ 	EState 	   *estate;
+ 	IndexInfo  *indexInfo;
+ 
  	/*
  	 * Resource snapshot for time of sort start.
  	 */
***************
*** 450,455 ****
--- 457,468 ----
  static void readtup_heap(Tuplesortstate *state, SortTuple *stup,
  			 int tapenum, unsigned int len);
  static void reversedirection_heap(Tuplesortstate *state);
+ static int comparetup_rawheap(const SortTuple *a, const SortTuple *b,
+                              Tuplesortstate *state);
+ static void copytup_rawheap(Tuplesortstate *state, SortTuple *stup, void *tup);
+ static void writetup_rawheap(Tuplesortstate *state, int tapenum, SortTuple *stup);
+ static void readtup_rawheap(Tuplesortstate *state, SortTuple *stup, int tapenum,
+                             unsigned int len);
  static int comparetup_index_btree(const SortTuple *a, const SortTuple *b,
  					   Tuplesortstate *state);
  static int comparetup_index_hash(const SortTuple *a, const SortTuple *b,
***************
*** 549,554 ****
--- 562,570 ----
  
  	state->result_tape = -1;	/* flag that result tape has not been formed */
  
+ 	/* set estate to NULL, so we don't try to free it later if not used */
+ 	state->estate = NULL;
+ 
  	MemoryContextSwitchTo(oldcontext);
  
  	return state;
***************
*** 762,767 ****
--- 778,844 ----
  	return state;
  }
  
+ Tuplesortstate *
+ tuplesort_begin_rawheap(Relation indexRel,
+ 						TupleDesc tupDesc,
+                         int workMem, bool randomAccess)
+ {
+     Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+     MemoryContext oldcontext;
+ 
+     TupleTableSlot *existing_slot;
+     ExprContext	   *econtext;
+ 
+     Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
+ 
+     oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ 
+ #ifdef TRACE_SORT
+     if (trace_sort)
+         elog(LOG,
+              "begin tuple sort: nkeys = %d, workMem = %d, randomAccess = %c",
+              RelationGetNumberOfAttributes(indexRel), workMem, randomAccess ? 't' : 'f');
+ #endif
+ 
+ 	state->nKeys = RelationGetNumberOfAttributes(indexRel);
+ 
+ 	TRACE_POSTGRESQL_SORT_START(RAWHEAP_SORT, false, state->nKeys, workMem, randomAccess);
+ 
+     state->comparetup = comparetup_rawheap;
+     state->copytup = copytup_rawheap;
+     state->writetup = writetup_rawheap;
+     state->readtup = readtup_rawheap;
+     state->reversedirection = reversedirection_heap;
+ 
+ 	state->indexInfo = BuildIndexInfo(indexRel);
+ 	state->indexScanKey = _bt_mkscankey_nodata(indexRel);
+ 	state->enforceUnique = false;
+ 
+     state->tupDesc = tupDesc;    /* assume we need not copy tupDesc */
+ 
+     if (state->indexInfo->ii_Expressions != NULL)
+     {
+     	/* allocate the vars used by FormIndexDatum */
+ 		state->estate = CreateExecutorState();
+ 
+ 		/*
+ 		 * Need a TupleTableSlot to put existing tuples in.
+ 		 *
+ 		 * To use FormIndexDatum, we have to make the econtext's scantuple point
+ 		 * to this slot.  Be sure to save and restore caller's value for
+ 		 * scantuple.
+ 		 */
+ 		existing_slot = MakeSingleTupleTableSlot(tupDesc);
+ 
+ 		econtext = GetPerTupleExprContext(state->estate);
+ 		econtext->ecxt_scantuple = existing_slot;
+     }
+ 
+     MemoryContextSwitchTo(oldcontext);
+ 
+     return state;
+ }
+ 
  /*
   * tuplesort_set_bound
   *
***************
*** 849,854 ****
--- 926,934 ----
  	 */
  	TRACE_POSTGRESQL_SORT_DONE(state->tapeset != NULL, 0L);
  #endif
+ 	if (state->estate != NULL)
+ 		ExecDropSingleTupleTableSlot(GetPerTupleExprContext(state->estate)->ecxt_scantuple);
+ 
  
  	MemoryContextSwitchTo(oldcontext);
  
***************
*** 980,985 ****
--- 1060,1087 ----
  }
  
  /*
+ * Accept one tuple while collecting input data for sort.
+ *
+ * Note that the input data is always copied; the caller need not save it.
+ */
+ void
+ tuplesort_putrawtuple(Tuplesortstate *state, HeapTuple tup)
+ {
+     MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
+     SortTuple    stup;
+ 
+     /*
+      * Copy the given tuple into memory we control, and decrease availMem.
+      * Then call the common code.
+      */
+     COPYTUP(state, &stup, (void *) tup);
+ 
+     puttuple_common(state, &stup);
+ 
+     MemoryContextSwitchTo(oldcontext);
+ }
+ 
+ /*
   * Shared code for tuple and datum cases.
   */
  static void
***************
*** 1482,1487 ****
--- 1584,1609 ----
  }
  
  /*
+ * Fetch the next tuple in either forward or back direction.
+ * Returns NULL if no more tuples.    If *should_free is set, the
+ * caller must pfree the returned tuple when done with it.
+ */
+ HeapTuple
+ tuplesort_getrawtuple(Tuplesortstate *state, bool forward,
+                      bool *should_free)
+ {
+     MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
+     SortTuple    stup;
+ 
+     if (!tuplesort_gettuple_common(state, forward, &stup, should_free))
+         stup.tuple = NULL;
+ 
+     MemoryContextSwitchTo(oldcontext);
+ 
+     return stup.tuple;
+ }
+ 
+ /*
   * tuplesort_merge_order - report merge order we'll use for given memory
   * (note: "merge order" just means the number of input tapes in the merge).
   *
***************
*** 3079,3084 ****
--- 3201,3412 ----
  }
  
  /*
+ * Routines specialized for Raw on-disk HeapTuple case These are only used when
+ * we need the visibility info for things like CLUSTER. Otherwise we used the
+ * regular *tup_heap methods which actually manipulate MinimalTuples.
+ */
+ static int
+ comparetup_rawheap(const SortTuple *a, const SortTuple *b, Tuplesortstate *state)
+ {
+     ScanKey        scanKey = state->indexScanKey;
+     HeapTuple ltup;
+     HeapTuple rtup;
+     TupleDesc    tupDesc;
+     int            nkey;
+     int32        compare;
+ 
+     /* Allow interrupting long sorts */
+     CHECK_FOR_INTERRUPTS();
+ 
+     /* Compare the leading sort key */
+     compare = inlineApplySortFunction(&scanKey->sk_func, scanKey->sk_flags,
+                                      a->datum1, a->isnull1,
+                                      b->datum1, b->isnull1);
+     if (compare != 0)
+         return compare;
+ 
+     /* Compare additional sort keys */
+     ltup = (HeapTuple) a->tuple;
+     rtup = (HeapTuple) b->tuple;
+ 
+     if (state->indexInfo->ii_Expressions == NULL)
+     {
+         /* if not expression index, just get the proper heap_getattr */
+ 
+     	tupDesc = state->tupDesc;
+         scanKey++;
+ 
+         for (nkey = 1; nkey < state->nKeys; nkey++, scanKey++)
+         {
+             Datum       datum1,
+                         datum2;
+             bool        isnull1,
+                         isnull2;
+ 
+             datum1 = heap_getattr(ltup, state->indexInfo->ii_KeyAttrNumbers[nkey], tupDesc, &isnull1);
+             datum2 = heap_getattr(rtup, state->indexInfo->ii_KeyAttrNumbers[nkey], tupDesc, &isnull2);
+ 
+             compare = inlineApplySortFunction(&scanKey->sk_func, scanKey->sk_flags,
+                                              datum1, isnull1,
+                                              datum2, isnull2);
+             if (compare != 0)
+                 return compare;
+         }
+     }
+     else
+     {
+     	/* in the expression index case, we get all the values/nulls:
+     	 * it would be faster to get only the required ones, but it would mean
+     	 * copy&paste from FormIndexDatum
+     	*/
+ 
+ 		Datum		l_existing_values[INDEX_MAX_KEYS];
+ 		bool		l_existing_isnull[INDEX_MAX_KEYS];
+ 		Datum		r_existing_values[INDEX_MAX_KEYS];
+ 		bool		r_existing_isnull[INDEX_MAX_KEYS];
+ 		TupleTableSlot *ecxt_scantuple;
+ 
+ 		ecxt_scantuple = GetPerTupleExprContext(state->estate)->ecxt_scantuple;
+ 
+ 		ExecStoreTuple(ltup,	ecxt_scantuple, InvalidBuffer, false);
+     	FormIndexDatum(state->indexInfo, ecxt_scantuple, state->estate,
+ 						l_existing_values, l_existing_isnull);
+ 
+ 		ExecStoreTuple(rtup,	ecxt_scantuple, InvalidBuffer, false);
+     	FormIndexDatum(state->indexInfo, ecxt_scantuple, state->estate,
+ 						r_existing_values, r_existing_isnull);
+ 
+     	for (nkey = 1; nkey < state->nKeys; nkey++, scanKey++)
+         {
+             compare = inlineApplySortFunction(&scanKey->sk_func,
+ 												scanKey->sk_flags,
+ 												l_existing_values[nkey],
+ 												l_existing_isnull[nkey],
+ 												r_existing_values[nkey],
+ 												r_existing_isnull[nkey]);
+ 
+             if (compare != 0)
+                 return compare;
+ 
+         }
+     }
+ 
+ 
+ 
+     return 0;
+ }
+ 
+ static void
+ copytup_rawheap(Tuplesortstate *state, SortTuple *stup, void *tup)
+ {
+     HeapTuple    tuple = (HeapTuple) tup;
+ 
+     /* copy the tuple into sort storage */
+     stup->tuple = (void *) heap_copytuple(tuple);
+     USEMEM(state, GetMemoryChunkSpace(stup->tuple));
+     /* set up first-column key value */
+     if (state->indexInfo->ii_Expressions == NULL)
+     {
+     	/* no expression index, just get the key datum value */
+ 		stup->datum1 = heap_getattr((HeapTuple) stup->tuple,
+ 									state->indexInfo->ii_KeyAttrNumbers[0],
+ 									state->tupDesc,
+ 									&stup->isnull1);
+     }
+     else
+     {
+     	/*
+     	 * Extract the index column values and isnull flags from the existing
+     	 * tuple; we're interested only in the very first one, but to avoid
+     	 * copy&paste from FormIndexDatum we get all of them (even if it's
+     	 * slower)
+ 		*/
+ 
+ 		Datum		existing_values[INDEX_MAX_KEYS];
+ 		bool		existing_isnull[INDEX_MAX_KEYS];
+ 		TupleTableSlot *ecxt_scantuple;
+ 
+ 		ecxt_scantuple = GetPerTupleExprContext(state->estate)->ecxt_scantuple;
+     	ExecStoreTuple(tuple,	ecxt_scantuple, InvalidBuffer, false);
+     	FormIndexDatum(state->indexInfo, ecxt_scantuple, state->estate,
+     				   existing_values, existing_isnull);
+ 
+     	stup->datum1 = existing_values[0];
+     	stup->isnull1 = existing_isnull[0];
+     }
+ }
+ 
+ static void
+ writetup_rawheap(Tuplesortstate *state, int tapenum, SortTuple *stup)
+ {
+     HeapTuple    tuple = (HeapTuple) stup->tuple;
+     tuple->t_len += HEAPTUPLESIZE; /* write out the header as well */
+ 
+ 	LogicalTapeWrite(state->tapeset, tapenum,
+ 					 tuple, HEAPTUPLESIZE);
+ 	LogicalTapeWrite(state->tapeset, tapenum, tuple->t_data, tuple->t_len-HEAPTUPLESIZE);
+     if (state->randomAccess)    /* need trailing length word? */
+         LogicalTapeWrite(state->tapeset, tapenum,
+                          tuple, sizeof(tuple->t_len));
+ 
+     FREEMEM(state, GetMemoryChunkSpace(tuple));
+     heap_freetuple(tuple);
+ }
+ 
+ static void
+ readtup_rawheap(Tuplesortstate *state, SortTuple *stup,
+              int tapenum, unsigned int tuplen)
+ {
+ 	HeapTuple    tuple = (HeapTuple) palloc(tuplen);
+ 
+     USEMEM(state, GetMemoryChunkSpace(tuple));
+ 
+     tuple->t_len = tuplen - HEAPTUPLESIZE;
+     if (LogicalTapeRead(state->tapeset, tapenum, &tuple->t_self, HEAPTUPLESIZE-sizeof(tuplen)) != HEAPTUPLESIZE-sizeof(tuplen))
+         elog(ERROR, "unexpected end of data");
+     tuple->t_data = (HeapTupleHeader) ((char *) tuple + HEAPTUPLESIZE);
+     if (LogicalTapeRead(state->tapeset, tapenum, tuple->t_data, tuple->t_len) != tuple->t_len)
+         elog(ERROR, "unexpected end of data");
+     if (state->randomAccess)    /* need trailing length word? */
+         if (LogicalTapeRead(state->tapeset, tapenum, &tuplen,
+                             sizeof(tuplen)) != sizeof(tuplen))
+             elog(ERROR, "unexpected end of data");
+ 
+     stup->tuple = tuple;
+ 
+     /* set up first-column key value */
+     if (state->indexInfo->ii_Expressions == NULL)
+     {
+     	/* no expression index, just get the key datum value */
+ 		stup->datum1 = heap_getattr((HeapTuple) stup->tuple,
+ 									state->indexInfo->ii_KeyAttrNumbers[0],
+ 									state->tupDesc,
+ 									&stup->isnull1);
+     }
+     else
+     {
+     	/*
+     	 * Extract the index column values and isnull flags from the existing
+     	 * tuple; we're interested only in the very first one, but to avoid
+     	 * copy&paste from FormIndexDatum we get all of them (even if it's
+     	 * slower)
+ 		*/
+ 
+ 		Datum		existing_values[INDEX_MAX_KEYS];
+ 		bool		existing_isnull[INDEX_MAX_KEYS];
+ 		TupleTableSlot *ecxt_scantuple;
+ 
+ 		ecxt_scantuple = GetPerTupleExprContext(state->estate)->ecxt_scantuple;
+     	ExecStoreTuple(tuple,	ecxt_scantuple, InvalidBuffer, false);
+     	FormIndexDatum(state->indexInfo, ecxt_scantuple, state->estate,
+     				   existing_values, existing_isnull);
+ 
+     	stup->datum1 = existing_values[0];
+     	stup->isnull1 = existing_isnull[0];
+     }
+ }
+ 
+ /*
   * Convenience routine to free a tuple previously loaded into sort memory
   */
  static void
Index: src/include/optimizer/cost.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/optimizer/cost.h,v
retrieving revision 1.100
diff -c -r1.100 cost.h
*** src/include/optimizer/cost.h	2 Jan 2010 16:58:07 -0000	1.100
--- src/include/optimizer/cost.h	10 Feb 2010 17:03:52 -0000
***************
*** 109,114 ****
--- 109,118 ----
  extern void cost_subplan(PlannerInfo *root, SubPlan *subplan, Plan *plan);
  extern void cost_qual_eval(QualCost *cost, List *quals, PlannerInfo *root);
  extern void cost_qual_eval_node(QualCost *cost, Node *qual, PlannerInfo *root);
+ extern void cost_index_scan_vs_seqscansort(Oid tableOid, Oid indexOid,
+ 											Cost 	*indexScanTotCost,
+ 											Cost	*seqAndSortCost);
+ 
  extern void set_baserel_size_estimates(PlannerInfo *root, RelOptInfo *rel);
  extern void set_joinrel_size_estimates(PlannerInfo *root, RelOptInfo *rel,
  						   RelOptInfo *outer_rel,
Index: src/include/utils/tuplesort.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/utils/tuplesort.h,v
retrieving revision 1.35
diff -c -r1.35 tuplesort.h
*** src/include/utils/tuplesort.h	2 Jan 2010 16:58:10 -0000	1.35
--- src/include/utils/tuplesort.h	10 Feb 2010 17:03:52 -0000
***************
*** 64,69 ****
--- 64,72 ----
  extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
  					  Oid sortOperator, bool nullsFirstFlag,
  					  int workMem, bool randomAccess);
+ extern Tuplesortstate *tuplesort_begin_rawheap(Relation indexRel,
+ 												TupleDesc tupDesc,
+ 												int workMem, bool randomAccess);
  
  extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
  
***************
*** 72,77 ****
--- 75,81 ----
  extern void tuplesort_putindextuple(Tuplesortstate *state, IndexTuple tuple);
  extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
  				   bool isNull);
+ extern void tuplesort_putrawtuple(Tuplesortstate *state, HeapTuple tup);
  
  extern void tuplesort_performsort(Tuplesortstate *state);
  
***************
*** 81,86 ****
--- 85,92 ----
  						bool *should_free);
  extern bool tuplesort_getdatum(Tuplesortstate *state, bool forward,
  				   Datum *val, bool *isNull);
+ extern HeapTuple tuplesort_getrawtuple(Tuplesortstate *state, bool forward,
+                                      bool *should_free);
  
  extern void tuplesort_end(Tuplesortstate *state);
  
Index: src/backend/commands/cluster.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/commands/cluster.c,v
retrieving revision 1.200
diff -c -r1.200 cluster.c
*** src/backend/commands/cluster.c	9 Feb 2010 21:43:30 -0000	1.200
--- src/backend/commands/cluster.c	10 Feb 2010 17:03:49 -0000
***************
*** 49,55 ****
  #include "utils/snapmgr.h"
  #include "utils/syscache.h"
  #include "utils/tqual.h"
! 
  
  /*
   * This struct is used to pass around the information on tables to be
--- 49,59 ----
  #include "utils/snapmgr.h"
  #include "utils/syscache.h"
  #include "utils/tqual.h"
! #include "utils/tuplesort.h"
! #include "optimizer/plancat.h"
! #include "optimizer/pathnode.h"
! #include "optimizer/cost.h"
! #include "executor/spi_priv.h"
  
  /*
   * This struct is used to pass around the information on tables to be
***************
*** 69,76 ****
  			   int freeze_min_age, int freeze_table_age,
  			   bool *pSwapToastByContent, TransactionId *pFreezeXid);
  static List *get_tables_to_cluster(MemoryContext cluster_context);
! 
! 
  
  /*---------------------------------------------------------------------------
   * This cluster code allows for clustering multiple tables at once. Because
--- 73,81 ----
  			   int freeze_min_age, int freeze_table_age,
  			   bool *pSwapToastByContent, TransactionId *pFreezeXid);
  static List *get_tables_to_cluster(MemoryContext cluster_context);
! static void deform_and_rewrite_tuple(HeapTuple tuple, TupleDesc oldTupDesc, TupleDesc newTupDesc,
! 										Datum *values, bool *isnull,
! 										bool newRelHasOids, RewriteState rwstate);
  
  /*---------------------------------------------------------------------------
   * This cluster code allows for clustering multiple tables at once. Because
***************
*** 769,774 ****
--- 774,781 ----
  	TransactionId OldestXmin;
  	TransactionId FreezeXid;
  	RewriteState rwstate;
+ 	bool 		 use_sort = false;
+ 	Tuplesortstate *tuplesort;
  
  	/*
  	 * Open the relations we need.
***************
*** 877,883 ****
  	 * tuples that still need to be copied, we scan with SnapshotAny and use
  	 * HeapTupleSatisfiesVacuum for the visibility test.
  	 */
! 	if (OldIndex != NULL)
  	{
  		heapScan = NULL;
  		indexScan = index_beginscan(OldHeap, OldIndex,
--- 884,900 ----
  	 * tuples that still need to be copied, we scan with SnapshotAny and use
  	 * HeapTupleSatisfiesVacuum for the visibility test.
  	 */
! 	if (OldIndex != NULL && (OldIndex->rd_am->amcanorder) && (OldIndex->rd_rel->relam == BTREE_AM_OID))
! 	{
! 		Cost 	indexScanTotCost,
! 				seqAndSortCost;
! 		cost_index_scan_vs_seqscansort(OIDOldHeap, OIDOldIndex,
! 													&indexScanTotCost,
! 													&seqAndSortCost);
! 		use_sort = seqAndSortCost < indexScanTotCost;
! 	}
! 
! 	if (OldIndex != NULL && !use_sort)
  	{
  		heapScan = NULL;
  		indexScan = index_beginscan(OldHeap, OldIndex,
***************
*** 886,905 ****
  	else
  	{
  		heapScan = heap_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL);
  		indexScan = NULL;
  	}
  
  	for (;;)
  	{
  		HeapTuple	tuple;
- 		HeapTuple	copiedTuple;
  		Buffer		buf;
! 		bool		isdead;
! 		int			i;
  
  		CHECK_FOR_INTERRUPTS();
  
! 		if (OldIndex != NULL)
  		{
  			tuple = index_getnext(indexScan, ForwardScanDirection);
  			if (tuple == NULL)
--- 903,926 ----
  	else
  	{
  		heapScan = heap_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL);
+ 		if (use_sort)
+ 		{
+ 			tuplesort = tuplesort_begin_rawheap(OldIndex, oldTupDesc,
+ 		                                        maintenance_work_mem, false);
+ 		}
  		indexScan = NULL;
  	}
  
  	for (;;)
  	{
+ 
  		HeapTuple	tuple;
  		Buffer		buf;
! 		bool isdead;
  
  		CHECK_FOR_INTERRUPTS();
  
! 		if (OldIndex != NULL && !use_sort)
  		{
  			tuple = index_getnext(indexScan, ForwardScanDirection);
  			if (tuple == NULL)
***************
*** 919,925 ****
  
  			buf = heapScan->rs_cbuf;
  		}
- 
  		LockBuffer(buf, BUFFER_LOCK_SHARE);
  
  		switch (HeapTupleSatisfiesVacuum(tuple->t_data, OldestXmin, buf))
--- 940,945 ----
***************
*** 978,1022 ****
  			continue;
  		}
  
! 		/*
! 		 * We cannot simply copy the tuple as-is, for several reasons:
! 		 *
! 		 * 1. We'd like to squeeze out the values of any dropped columns, both
! 		 * to save space and to ensure we have no corner-case failures. (It's
! 		 * possible for example that the new table hasn't got a TOAST table
! 		 * and so is unable to store any large values of dropped cols.)
! 		 *
! 		 * 2. The tuple might not even be legal for the new table; this is
! 		 * currently only known to happen as an after-effect of ALTER TABLE
! 		 * SET WITHOUT OIDS.
! 		 *
! 		 * So, we must reconstruct the tuple from component Datums.
! 		 */
! 		heap_deform_tuple(tuple, oldTupDesc, values, isnull);
! 
! 		/* Be sure to null out any dropped columns */
! 		for (i = 0; i < natts; i++)
  		{
! 			if (newTupDesc->attrs[i]->attisdropped)
! 				isnull[i] = true;
  		}
  
- 		copiedTuple = heap_form_tuple(newTupDesc, values, isnull);
  
! 		/* Preserve OID, if any */
! 		if (NewHeap->rd_rel->relhasoids)
! 			HeapTupleSetOid(copiedTuple, HeapTupleGetOid(tuple));
  
! 		/* The heap rewrite module does the rest */
! 		rewrite_heap_tuple(rwstate, tuple, copiedTuple);
  
! 		heap_freetuple(copiedTuple);
  	}
  
! 	if (OldIndex != NULL)
  		index_endscan(indexScan);
  	else
  		heap_endscan(heapScan);
  
  	/* Write out any remaining tuples, and fsync if needed */
  	end_heap_rewrite(rwstate);
--- 998,1051 ----
  			continue;
  		}
  
! 		if (!use_sort)
! 		{
! 			deform_and_rewrite_tuple(tuple, oldTupDesc, newTupDesc, values, isnull,
! 										NewHeap->rd_rel->relhasoids, rwstate);
! 		}
! 		else
  		{
! 			/* pass tuple to tuple store */
! 			tuplesort_putrawtuple(tuplesort, tuple);
  		}
  
  
! 	}
! 
! 	if (use_sort)
! 	{
! 		tuplesort_performsort(tuplesort);
! 
! 		/* read from tuplestore */
! 		for (;;)
! 		{
! 			HeapTuple	tuple;
! 			bool        shouldfree;
! 
! 			CHECK_FOR_INTERRUPTS();
  
! 			tuple = tuplesort_getrawtuple(tuplesort, true, &shouldfree);
! 			if (tuple == NULL)
! 				break;
  
! 			deform_and_rewrite_tuple(tuple, oldTupDesc, newTupDesc, values, isnull,
! 										NewHeap->rd_rel->relhasoids, rwstate);
! 			if (shouldfree)
! 				heap_freetuple(tuple);
! 		}
  	}
  
! 
! 	if (OldIndex != NULL && !use_sort)
! 	{
  		index_endscan(indexScan);
+ 	}
  	else
+ 	{
  		heap_endscan(heapScan);
+ 		if (use_sort)
+ 			tuplesort_end(tuplesort);
+ 	}
  
  	/* Write out any remaining tuples, and fsync if needed */
  	end_heap_rewrite(rwstate);
***************
*** 1519,1521 ****
--- 1548,1594 ----
  
  	return rvs;
  }
+ 
+ static void deform_and_rewrite_tuple(HeapTuple tuple, TupleDesc oldTupDesc, TupleDesc newTupDesc,
+ 										Datum *values, bool *isnull,
+ 										bool newRelHasOids, RewriteState rwstate)
+ {
+ 	HeapTuple	copiedTuple;
+ 	int 		i;
+ 
+ 	/*
+ 	 * We cannot simply copy the tuple as-is, for several reasons:
+ 	 *
+ 	 * 1. We'd like to squeeze out the values of any dropped columns, both
+ 	 * to save space and to ensure we have no corner-case failures. (It's
+ 	 * possible for example that the new table hasn't got a TOAST table
+ 	 * and so is unable to store any large values of dropped cols.)
+ 	 *
+ 	 * 2. The tuple might not even be legal for the new table; this is
+ 	 * currently only known to happen as an after-effect of ALTER TABLE
+ 	 * SET WITHOUT OIDS.
+ 	 *
+ 	 * So, we must reconstruct the tuple from component Datums.
+ 	 */
+ 
+ 	heap_deform_tuple(tuple, oldTupDesc, values, isnull);
+ 
+ 	/* Be sure to null out any dropped columns */
+ 	for (i = 0; i < newTupDesc->natts; i++)
+ 	{
+ 		if (newTupDesc->attrs[i]->attisdropped)
+ 			isnull[i] = true;
+ 	}
+ 
+ 	copiedTuple = heap_form_tuple(newTupDesc, values, isnull);
+ 
+ 	/* Preserve OID, if any */
+ 	if (newRelHasOids)
+ 		HeapTupleSetOid(copiedTuple, HeapTupleGetOid(tuple));
+ 
+ 	/* The heap rewrite module does the rest */
+ 	rewrite_heap_tuple(rwstate, tuple, copiedTuple);
+ 
+ 	heap_freetuple(copiedTuple);
+ }
+

cluster_tests.sqlapplication/octet-stream; name=cluster_tests.sqlDownload

#40

itagaki.takahiro@oss.ntt.co.jp

over 15 years ago

In reply to: Leonardo F (#39)

1 attachment(s)

Re: I: About "Our CLUSTER implementation is pessimal" patch

Leonardo F <m_lists@yahoo.it> wrote:

Attached the updated patch (should solve a bug) and a script.

I reviewed your patch. It seems to be in good shape, and worked as
expected. I suppressed a compiler warning in the patch and cleaned up
whitespaces in it. Patch attached.

I think we need some documentation for the change. The only downside
of the feature is that sorted cluster requires twice disk spaces of
the target table (temp space for disk sort and the result table).
Could I ask you to write documentation about the new behavior?
Also, code comments can be improved; especially we need better
description than "copy&paste from FormIndexDatum".

Regards,
---
Takahiro Itagaki
NTT Open Source Software Center

Attachments:

sorted_cluster-20100706.patchapplication/octet-stream; name=sorted_cluster-20100706.patchDownload

diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index ccb4599..07fbf3b 100644
*** a/src/backend/commands/cluster.c
--- b/src/backend/commands/cluster.c
***************
*** 36,41 ****
--- 36,42 ----
  #include "commands/trigger.h"
  #include "commands/vacuum.h"
  #include "miscadmin.h"
+ #include "optimizer/cost.h"
  #include "storage/bufmgr.h"
  #include "storage/procarray.h"
  #include "storage/smgr.h"
***************
*** 49,54 ****
--- 50,56 ----
  #include "utils/snapmgr.h"
  #include "utils/syscache.h"
  #include "utils/tqual.h"
+ #include "utils/tuplesort.h"
  
  
  /*
*************** static void copy_heap_data(Oid OIDNewHea
*** 69,74 ****
--- 71,79 ----
  			   int freeze_min_age, int freeze_table_age,
  			   bool *pSwapToastByContent, TransactionId *pFreezeXid);
  static List *get_tables_to_cluster(MemoryContext cluster_context);
+ static void deform_and_rewrite_tuple(HeapTuple tuple, TupleDesc oldTupDesc,
+ 			   TupleDesc newTupDesc, Datum *values, bool *isnull,
+ 			   bool newRelHasOids, RewriteState rwstate);
  
  
  
*************** copy_heap_data(Oid OIDNewHeap, Oid OIDOl
*** 757,762 ****
--- 762,769 ----
  	TransactionId OldestXmin;
  	TransactionId FreezeXid;
  	RewriteState rwstate;
+ 	bool 		 use_sort;
+ 	Tuplesortstate *tuplesort;
  
  	/*
  	 * Open the relations we need.
*************** copy_heap_data(Oid OIDNewHeap, Oid OIDOl
*** 848,876 ****
  	 * tuples that still need to be copied, we scan with SnapshotAny and use
  	 * HeapTupleSatisfiesVacuum for the visibility test.
  	 */
! 	if (OldIndex != NULL)
  	{
  		heapScan = NULL;
  		indexScan = index_beginscan(OldHeap, OldIndex,
  									SnapshotAny, 0, (ScanKey) NULL);
  	}
  	else
  	{
  		heapScan = heap_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL);
  		indexScan = NULL;
  	}
  
  	for (;;)
  	{
  		HeapTuple	tuple;
- 		HeapTuple	copiedTuple;
  		Buffer		buf;
  		bool		isdead;
- 		int			i;
  
  		CHECK_FOR_INTERRUPTS();
  
! 		if (OldIndex != NULL)
  		{
  			tuple = index_getnext(indexScan, ForwardScanDirection);
  			if (tuple == NULL)
--- 855,899 ----
  	 * tuples that still need to be copied, we scan with SnapshotAny and use
  	 * HeapTupleSatisfiesVacuum for the visibility test.
  	 */
! 	if (OldIndex != NULL && OldIndex->rd_rel->relam == BTREE_AM_OID)
! 	{
! 		Cost 	indexScanCost,
! 				seqScanAndSortCost;
! 
! 		cost_index_scan_vs_seqscansort(OIDOldHeap, OIDOldIndex,
! 									   &indexScanCost, &seqScanAndSortCost);
! 		use_sort = seqScanAndSortCost < indexScanCost;
! 	}
! 	else
! 		use_sort = false;
! 
! 	if (OldIndex != NULL && !use_sort)
  	{
  		heapScan = NULL;
  		indexScan = index_beginscan(OldHeap, OldIndex,
  									SnapshotAny, 0, (ScanKey) NULL);
+ 		tuplesort = NULL;
  	}
  	else
  	{
  		heapScan = heap_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL);
  		indexScan = NULL;
+ 		if (use_sort)
+ 			tuplesort = tuplesort_begin_rawheap(OldIndex, oldTupDesc,
+ 												maintenance_work_mem, false);
+ 		else
+ 			tuplesort = NULL;
  	}
  
  	for (;;)
  	{
  		HeapTuple	tuple;
  		Buffer		buf;
  		bool		isdead;
  
  		CHECK_FOR_INTERRUPTS();
  
! 		if (indexScan != NULL)
  		{
  			tuple = index_getnext(indexScan, ForwardScanDirection);
  			if (tuple == NULL)
*************** copy_heap_data(Oid OIDNewHeap, Oid OIDOl
*** 949,993 ****
  			continue;
  		}
  
! 		/*
! 		 * We cannot simply copy the tuple as-is, for several reasons:
! 		 *
! 		 * 1. We'd like to squeeze out the values of any dropped columns, both
! 		 * to save space and to ensure we have no corner-case failures. (It's
! 		 * possible for example that the new table hasn't got a TOAST table
! 		 * and so is unable to store any large values of dropped cols.)
! 		 *
! 		 * 2. The tuple might not even be legal for the new table; this is
! 		 * currently only known to happen as an after-effect of ALTER TABLE
! 		 * SET WITHOUT OIDS.
! 		 *
! 		 * So, we must reconstruct the tuple from component Datums.
! 		 */
! 		heap_deform_tuple(tuple, oldTupDesc, values, isnull);
  
! 		/* Be sure to null out any dropped columns */
! 		for (i = 0; i < natts; i++)
! 		{
! 			if (newTupDesc->attrs[i]->attisdropped)
! 				isnull[i] = true;
! 		}
  
! 		copiedTuple = heap_form_tuple(newTupDesc, values, isnull);
  
! 		/* Preserve OID, if any */
! 		if (NewHeap->rd_rel->relhasoids)
! 			HeapTupleSetOid(copiedTuple, HeapTupleGetOid(tuple));
  
! 		/* The heap rewrite module does the rest */
! 		rewrite_heap_tuple(rwstate, tuple, copiedTuple);
  
! 		heap_freetuple(copiedTuple);
  	}
  
! 	if (OldIndex != NULL)
  		index_endscan(indexScan);
! 	else
  		heap_endscan(heapScan);
  
  	/* Write out any remaining tuples, and fsync if needed */
  	end_heap_rewrite(rwstate);
--- 972,1018 ----
  			continue;
  		}
  
! 		if (tuplesort != NULL)
! 			tuplesort_putrawtuple(tuplesort, tuple);
! 		else
! 			deform_and_rewrite_tuple(tuple, oldTupDesc, newTupDesc,
! 									 values, isnull,
! 									 NewHeap->rd_rel->relhasoids, rwstate);
! 	}
  
! 	/*
! 	 * In scan-and-sort mode, read all tuples from tuplestore and write out
! 	 * them into the new relation.
! 	 */
! 	if (tuplesort != NULL)
! 	{
! 		tuplesort_performsort(tuplesort);
  
! 		for (;;)
! 		{
! 			HeapTuple	tuple;
! 			bool		shouldfree;
  
! 			CHECK_FOR_INTERRUPTS();
  
! 			tuple = tuplesort_getrawtuple(tuplesort, true, &shouldfree);
! 			if (tuple == NULL)
! 				break;
  
! 			deform_and_rewrite_tuple(tuple, oldTupDesc, newTupDesc,
! 									 values, isnull,
! 									 NewHeap->rd_rel->relhasoids, rwstate);
! 			if (shouldfree)
! 				heap_freetuple(tuple);
! 		}
  	}
  
! 	if (indexScan != NULL)
  		index_endscan(indexScan);
! 	if (heapScan != NULL)
  		heap_endscan(heapScan);
+ 	if (tuplesort != NULL)
+ 		tuplesort_end(tuplesort);
  
  	/* Write out any remaining tuples, and fsync if needed */
  	end_heap_rewrite(rwstate);
*************** get_tables_to_cluster(MemoryContext clus
*** 1486,1488 ****
--- 1511,1557 ----
  
  	return rvs;
  }
+ 
+ static void deform_and_rewrite_tuple(HeapTuple tuple, TupleDesc oldTupDesc, TupleDesc newTupDesc,
+ 										Datum *values, bool *isnull,
+ 										bool newRelHasOids, RewriteState rwstate)
+ {
+ 	HeapTuple	copiedTuple;
+ 	int 		i;
+ 
+ 	/*
+ 	 * We cannot simply copy the tuple as-is, for several reasons:
+ 	 *
+ 	 * 1. We'd like to squeeze out the values of any dropped columns, both
+ 	 * to save space and to ensure we have no corner-case failures. (It's
+ 	 * possible for example that the new table hasn't got a TOAST table
+ 	 * and so is unable to store any large values of dropped cols.)
+ 	 *
+ 	 * 2. The tuple might not even be legal for the new table; this is
+ 	 * currently only known to happen as an after-effect of ALTER TABLE
+ 	 * SET WITHOUT OIDS.
+ 	 *
+ 	 * So, we must reconstruct the tuple from component Datums.
+ 	 */
+ 
+ 	heap_deform_tuple(tuple, oldTupDesc, values, isnull);
+ 
+ 	/* Be sure to null out any dropped columns */
+ 	for (i = 0; i < newTupDesc->natts; i++)
+ 	{
+ 		if (newTupDesc->attrs[i]->attisdropped)
+ 			isnull[i] = true;
+ 	}
+ 
+ 	copiedTuple = heap_form_tuple(newTupDesc, values, isnull);
+ 
+ 	/* Preserve OID, if any */
+ 	if (newRelHasOids)
+ 		HeapTupleSetOid(copiedTuple, HeapTupleGetOid(tuple));
+ 
+ 	/* The heap rewrite module does the rest */
+ 	rewrite_heap_tuple(rwstate, tuple, copiedTuple);
+ 
+ 	heap_freetuple(copiedTuple);
+ }
+ 
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 257c5f0..1592d33 100644
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
***************
*** 76,81 ****
--- 76,82 ----
  #include "optimizer/cost.h"
  #include "optimizer/pathnode.h"
  #include "optimizer/placeholder.h"
+ #include "optimizer/plancat.h"
  #include "optimizer/planmain.h"
  #include "optimizer/restrictinfo.h"
  #include "parser/parsetree.h"
*************** cost_qual_eval_walker(Node *node, cost_q
*** 2671,2676 ****
--- 2672,2748 ----
  								  (void *) context);
  }
  
+ /*
+  * cost_index_scan_vs_seqscansort
+  *		Estimates and returns the cost of an full-index scan and a sort after
+  *		a sequential scan.
+  */
+ void
+ cost_index_scan_vs_seqscansort(Oid tableOid,
+ 							   Oid indexOid,
+ 							   Cost *indexScanCost,
+ 							   Cost *seqScanAndSortCost)
+ {
+ 	RelOptInfo	   *rel;
+ 	PlannerInfo	   *root;
+ 	Query		   *query;
+ 	PlannerGlobal  *glob;
+ 	RangeTblEntry  *rte;
+ 	ListCell	   *index;
+ 	IndexPath	   *indexScanPath;
+ 	Path		   *seqScanAndSortPath;
+ 
+ 	/* make a dummy planner */
+ 	glob = makeNode(PlannerGlobal);
+ 
+ 	query = makeNode(Query);
+ 	query->resultRelation = 0;
+ 
+ 	root = makeNode(PlannerInfo);
+ 	root->parse = query;
+ 	root->glob = glob;
+ 
+ 	rel = makeNode(RelOptInfo);
+ 	rel->reloptkind = RELOPT_BASEREL;
+ 	rel->relid = 1;
+ 	rel->rtekind = RTE_RELATION;
+ 	get_relation_info(root, tableOid, false, rel);
+ 	rel->rows = rel->tuples;
+ 
+ 	root->total_table_pages = rel->pages;
+ 
+ 	rte = makeNode(RangeTblEntry);
+ 	rte->rtekind = RTE_RELATION;
+ 	rte->relid = tableOid;
+ 
+ 	root->simple_rel_array_size = 2;
+ 	root->simple_rte_array = (RangeTblEntry **)
+ 		palloc0(root->simple_rel_array_size * sizeof(RangeTblEntry *));
+ 	root->simple_rte_array[1] = rte;
+ 
+ 	/* estimate the cost of seq scan + sort */
+ 	seqScanAndSortPath = create_seqscan_path(NULL, rel);
+ 	cost_sort(seqScanAndSortPath, root, NULL, seqScanAndSortPath->total_cost,
+ 			  rel->tuples, rel->width, -1);
+ 
+ 	/* estimate the cost of index scan */
+ 	indexScanPath = NULL;
+ 	foreach(index, rel->indexlist)
+ 	{
+ 		IndexOptInfo *indexInfo = (IndexOptInfo*) lfirst(index);
+ 		if (indexInfo->indexoid == indexOid)
+ 		{
+ 			indexScanPath = create_index_path(root, indexInfo, NULL, NULL,
+ 											  ForwardScanDirection, NULL);
+ 			break;
+ 		}
+ 	}
+ 	Assert(indexScanPath != NULL);
+ 
+ 	*indexScanCost = indexScanPath->path.total_cost;
+ 	*seqScanAndSortCost = seqScanAndSortPath->total_cost;
+ }
+ 
  
  /*
   * adjust_semi_join
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 9bc78d6..3a23a33 100644
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
***************
*** 104,109 ****
--- 104,110 ----
  #include "access/nbtree.h"
  #include "catalog/pg_amop.h"
  #include "catalog/pg_operator.h"
+ #include "catalog/index.h"
  #include "commands/tablespace.h"
  #include "miscadmin.h"
  #include "pg_trace.h"
***************
*** 115,126 ****
--- 116,129 ----
  #include "utils/rel.h"
  #include "utils/syscache.h"
  #include "utils/tuplesort.h"
+ #include "executor/executor.h"
  
  
  /* sort-type codes for sort__start probes */
  #define HEAP_SORT	0
  #define INDEX_SORT	1
  #define DATUM_SORT	2
+ #define RAWHEAP_SORT	3
  
  /* GUC variables */
  #ifdef TRACE_SORT
*************** struct Tuplesortstate
*** 366,371 ****
--- 369,378 ----
  	int			datumTypeLen;
  	bool		datumTypeByVal;
  
+ 	/* These are specific to the rawheap subcase: */
+ 	EState 	   *estate;
+ 	IndexInfo  *indexInfo;
+ 
  	/*
  	 * Resource snapshot for time of sort start.
  	 */
*************** static void writetup_heap(Tuplesortstate
*** 450,455 ****
--- 457,468 ----
  static void readtup_heap(Tuplesortstate *state, SortTuple *stup,
  			 int tapenum, unsigned int len);
  static void reversedirection_heap(Tuplesortstate *state);
+ static int comparetup_rawheap(const SortTuple *a, const SortTuple *b,
+ 							  Tuplesortstate *state);
+ static void copytup_rawheap(Tuplesortstate *state, SortTuple *stup, void *tup);
+ static void writetup_rawheap(Tuplesortstate *state, int tapenum, SortTuple *stup);
+ static void readtup_rawheap(Tuplesortstate *state, SortTuple *stup,
+ 							int tapenum, unsigned int len);
  static int comparetup_index_btree(const SortTuple *a, const SortTuple *b,
  					   Tuplesortstate *state);
  static int comparetup_index_hash(const SortTuple *a, const SortTuple *b,
*************** tuplesort_begin_common(int workMem, bool
*** 549,554 ****
--- 562,570 ----
  
  	state->result_tape = -1;	/* flag that result tape has not been formed */
  
+ 	/* set estate to NULL, so we don't try to free it later if not used */
+ 	state->estate = NULL;
+ 
  	MemoryContextSwitchTo(oldcontext);
  
  	return state;
*************** tuplesort_begin_datum(Oid datumType,
*** 762,767 ****
--- 778,843 ----
  	return state;
  }
  
+ Tuplesortstate *
+ tuplesort_begin_rawheap(Relation indexRel, TupleDesc tupDesc,
+ 						int workMem, bool randomAccess)
+ {
+ 	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+ 	MemoryContext oldcontext;
+ 
+ 	TupleTableSlot *existing_slot;
+ 	ExprContext	   *econtext;
+ 
+ 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
+ 
+ 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ 
+ #ifdef TRACE_SORT
+ 	if (trace_sort)
+ 		elog(LOG,
+ 			 "begin tuple sort: nkeys = %d, workMem = %d, randomAccess = %c",
+ 			 RelationGetNumberOfAttributes(indexRel), workMem, randomAccess ? 't' : 'f');
+ #endif
+ 
+ 	state->nKeys = RelationGetNumberOfAttributes(indexRel);
+ 
+ 	TRACE_POSTGRESQL_SORT_START(RAWHEAP_SORT, false, state->nKeys, workMem, randomAccess);
+ 
+ 	state->comparetup = comparetup_rawheap;
+ 	state->copytup = copytup_rawheap;
+ 	state->writetup = writetup_rawheap;
+ 	state->readtup = readtup_rawheap;
+ 	state->reversedirection = reversedirection_heap;
+ 
+ 	state->indexInfo = BuildIndexInfo(indexRel);
+ 	state->indexScanKey = _bt_mkscankey_nodata(indexRel);
+ 	state->enforceUnique = false;
+ 
+ 	state->tupDesc = tupDesc;	/* assume we need not copy tupDesc */
+ 
+ 	if (state->indexInfo->ii_Expressions != NULL)
+ 	{
+ 		/* allocate the vars used by FormIndexDatum */
+ 		state->estate = CreateExecutorState();
+ 
+ 		/*
+ 		 * Need a TupleTableSlot to put existing tuples in.
+ 		 *
+ 		 * To use FormIndexDatum, we have to make the econtext's scantuple point
+ 		 * to this slot.  Be sure to save and restore caller's value for
+ 		 * scantuple.
+ 		 */
+ 		existing_slot = MakeSingleTupleTableSlot(tupDesc);
+ 
+ 		econtext = GetPerTupleExprContext(state->estate);
+ 		econtext->ecxt_scantuple = existing_slot;
+ 	}
+ 
+ 	MemoryContextSwitchTo(oldcontext);
+ 
+ 	return state;
+ }
+ 
  /*
   * tuplesort_set_bound
   *
*************** tuplesort_end(Tuplesortstate *state)
*** 849,854 ****
--- 925,933 ----
  	 */
  	TRACE_POSTGRESQL_SORT_DONE(state->tapeset != NULL, 0L);
  #endif
+ 	if (state->estate != NULL)
+ 		ExecDropSingleTupleTableSlot(GetPerTupleExprContext(state->estate)->ecxt_scantuple);
+ 
  
  	MemoryContextSwitchTo(oldcontext);
  
*************** tuplesort_putdatum(Tuplesortstate *state
*** 980,985 ****
--- 1059,1086 ----
  }
  
  /*
+ * Accept one tuple while collecting input data for sort.
+ *
+ * Note that the input data is always copied; the caller need not save it.
+ */
+ void
+ tuplesort_putrawtuple(Tuplesortstate *state, HeapTuple tup)
+ {
+ 	MemoryContext	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ 	SortTuple		stup;
+ 
+ 	/*
+ 	 * Copy the given tuple into memory we control, and decrease availMem.
+ 	 * Then call the common code.
+ 	 */
+ 	COPYTUP(state, &stup, (void *) tup);
+ 
+ 	puttuple_common(state, &stup);
+ 
+ 	MemoryContextSwitchTo(oldcontext);
+ }
+ 
+ /*
   * Shared code for tuple and datum cases.
   */
  static void
*************** tuplesort_getdatum(Tuplesortstate *state
*** 1482,1487 ****
--- 1583,1607 ----
  }
  
  /*
+ * Fetch the next tuple in either forward or back direction. Returns NULL if
+ * no more tuples. If *should_free is set, the caller must pfree the returned
+ * tuple when done with it.
+ */
+ HeapTuple
+ tuplesort_getrawtuple(Tuplesortstate *state, bool forward, bool *should_free)
+ {
+ 	MemoryContext	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ 	SortTuple		stup;
+ 
+ 	if (!tuplesort_gettuple_common(state, forward, &stup, should_free))
+ 		stup.tuple = NULL;
+ 
+ 	MemoryContextSwitchTo(oldcontext);
+ 
+ 	return stup.tuple;
+ }
+ 
+ /*
   * tuplesort_merge_order - report merge order we'll use for given memory
   * (note: "merge order" just means the number of input tapes in the merge).
   *
*************** reversedirection_datum(Tuplesortstate *s
*** 3079,3084 ****
--- 3199,3409 ----
  }
  
  /*
+ * Routines specialized for Raw on-disk HeapTuple case These are only used when
+ * we need the visibility info for things like CLUSTER. Otherwise we used the
+ * regular *tup_heap methods which actually manipulate MinimalTuples.
+ */
+ static int
+ comparetup_rawheap(const SortTuple *a, const SortTuple *b, Tuplesortstate *state)
+ {
+ 	ScanKey		scanKey = state->indexScanKey;
+ 	HeapTuple	ltup;
+ 	HeapTuple	rtup;
+ 	TupleDesc	tupDesc;
+ 	int			nkey;
+ 	int32		compare;
+ 
+ 	/* Allow interrupting long sorts */
+ 	CHECK_FOR_INTERRUPTS();
+ 
+ 	/* Compare the leading sort key */
+ 	compare = inlineApplySortFunction(&scanKey->sk_func,
+ 									  scanKey->sk_flags,
+ 									  a->datum1, a->isnull1,
+ 									  b->datum1, b->isnull1);
+ 	if (compare != 0)
+ 		return compare;
+ 
+ 	/* Compare additional sort keys */
+ 	ltup = (HeapTuple) a->tuple;
+ 	rtup = (HeapTuple) b->tuple;
+ 
+ 	if (state->indexInfo->ii_Expressions == NULL)
+ 	{
+ 		/* if not expression index, just get the proper heap_getattr */
+ 
+ 		tupDesc = state->tupDesc;
+ 		scanKey++;
+ 
+ 		for (nkey = 1; nkey < state->nKeys; nkey++, scanKey++)
+ 		{
+ 			Datum		datum1,
+ 						datum2;
+ 			bool		isnull1,
+ 						isnull2;
+ 
+ 			datum1 = heap_getattr(ltup, state->indexInfo->ii_KeyAttrNumbers[nkey], tupDesc, &isnull1);
+ 			datum2 = heap_getattr(rtup, state->indexInfo->ii_KeyAttrNumbers[nkey], tupDesc, &isnull2);
+ 
+ 			compare = inlineApplySortFunction(&scanKey->sk_func,
+ 											  scanKey->sk_flags,
+ 											  datum1, isnull1,
+ 											  datum2, isnull2);
+ 			if (compare != 0)
+ 				return compare;
+ 		}
+ 	}
+ 	else
+ 	{
+ 		/*
+ 		 * in the expression index case, we get all the values/nulls:
+ 		 * it would be faster to get only the required ones, but it would mean
+ 		 * copy&paste from FormIndexDatum
+ 		 */
+ 
+ 		Datum		l_existing_values[INDEX_MAX_KEYS];
+ 		bool		l_existing_isnull[INDEX_MAX_KEYS];
+ 		Datum		r_existing_values[INDEX_MAX_KEYS];
+ 		bool		r_existing_isnull[INDEX_MAX_KEYS];
+ 		TupleTableSlot *ecxt_scantuple;
+ 
+ 		ecxt_scantuple = GetPerTupleExprContext(state->estate)->ecxt_scantuple;
+ 
+ 		ExecStoreTuple(ltup, ecxt_scantuple, InvalidBuffer, false);
+ 		FormIndexDatum(state->indexInfo, ecxt_scantuple, state->estate,
+ 					   l_existing_values, l_existing_isnull);
+ 
+ 		ExecStoreTuple(rtup, ecxt_scantuple, InvalidBuffer, false);
+ 		FormIndexDatum(state->indexInfo, ecxt_scantuple, state->estate,
+ 					   r_existing_values, r_existing_isnull);
+ 
+ 		for (nkey = 1; nkey < state->nKeys; nkey++, scanKey++)
+ 		{
+ 			compare = inlineApplySortFunction(&scanKey->sk_func,
+ 											  scanKey->sk_flags,
+ 											  l_existing_values[nkey],
+ 											  l_existing_isnull[nkey],
+ 											  r_existing_values[nkey],
+ 											  r_existing_isnull[nkey]);
+ 			if (compare != 0)
+ 				return compare;
+ 		}
+ 	}
+ 
+ 	return 0;
+ }
+ 
+ static void
+ copytup_rawheap(Tuplesortstate *state, SortTuple *stup, void *tup)
+ {
+ 	HeapTuple	tuple = (HeapTuple) tup;
+ 
+ 	/* copy the tuple into sort storage */
+ 	stup->tuple = (void *) heap_copytuple(tuple);
+ 	USEMEM(state, GetMemoryChunkSpace(stup->tuple));
+ 	/* set up first-column key value */
+ 	if (state->indexInfo->ii_Expressions == NULL)
+ 	{
+ 		/* no expression index, just get the key datum value */
+ 		stup->datum1 = heap_getattr((HeapTuple) stup->tuple,
+ 									state->indexInfo->ii_KeyAttrNumbers[0],
+ 									state->tupDesc,
+ 									&stup->isnull1);
+ 	}
+ 	else
+ 	{
+ 		/*
+ 		 * Extract the index column values and isnull flags from the existing
+ 		 * tuple; we're interested only in the very first one, but to avoid
+ 		 * copy&paste from FormIndexDatum we get all of them (even if it's
+ 		 * slower)
+ 		 */
+ 
+ 		Datum		existing_values[INDEX_MAX_KEYS];
+ 		bool		existing_isnull[INDEX_MAX_KEYS];
+ 		TupleTableSlot *ecxt_scantuple;
+ 
+ 		ecxt_scantuple = GetPerTupleExprContext(state->estate)->ecxt_scantuple;
+ 		ExecStoreTuple(tuple,	ecxt_scantuple, InvalidBuffer, false);
+ 		FormIndexDatum(state->indexInfo, ecxt_scantuple, state->estate,
+ 					   existing_values, existing_isnull);
+ 
+ 		stup->datum1 = existing_values[0];
+ 		stup->isnull1 = existing_isnull[0];
+ 	}
+ }
+ 
+ static void
+ writetup_rawheap(Tuplesortstate *state, int tapenum, SortTuple *stup)
+ {
+ 	HeapTuple	tuple = (HeapTuple) stup->tuple;
+ 	tuple->t_len += HEAPTUPLESIZE; /* write out the header as well */
+ 
+ 	LogicalTapeWrite(state->tapeset, tapenum,
+ 					 tuple, HEAPTUPLESIZE);
+ 	LogicalTapeWrite(state->tapeset, tapenum, tuple->t_data, tuple->t_len-HEAPTUPLESIZE);
+ 	if (state->randomAccess)	/* need trailing length word? */
+ 		LogicalTapeWrite(state->tapeset, tapenum,
+ 						 tuple, sizeof(tuple->t_len));
+ 
+ 	FREEMEM(state, GetMemoryChunkSpace(tuple));
+ 	heap_freetuple(tuple);
+ }
+ 
+ static void
+ readtup_rawheap(Tuplesortstate *state, SortTuple *stup,
+ 				int tapenum, unsigned int tuplen)
+ {
+ 	HeapTuple	tuple = (HeapTuple) palloc(tuplen);
+ 
+ 	USEMEM(state, GetMemoryChunkSpace(tuple));
+ 
+ 	tuple->t_len = tuplen - HEAPTUPLESIZE;
+ 	if (LogicalTapeRead(state->tapeset, tapenum, &tuple->t_self, HEAPTUPLESIZE-sizeof(tuplen)) != HEAPTUPLESIZE-sizeof(tuplen))
+ 		elog(ERROR, "unexpected end of data");
+ 	tuple->t_data = (HeapTupleHeader) ((char *) tuple + HEAPTUPLESIZE);
+ 	if (LogicalTapeRead(state->tapeset, tapenum, tuple->t_data, tuple->t_len) != tuple->t_len)
+ 		elog(ERROR, "unexpected end of data");
+ 	if (state->randomAccess)	/* need trailing length word? */
+ 		if (LogicalTapeRead(state->tapeset, tapenum, &tuplen,
+ 							sizeof(tuplen)) != sizeof(tuplen))
+ 			elog(ERROR, "unexpected end of data");
+ 
+ 	stup->tuple = tuple;
+ 
+ 	/* set up first-column key value */
+ 	if (state->indexInfo->ii_Expressions == NULL)
+ 	{
+ 		/* no expression index, just get the key datum value */
+ 		stup->datum1 = heap_getattr((HeapTuple) stup->tuple,
+ 									state->indexInfo->ii_KeyAttrNumbers[0],
+ 									state->tupDesc,
+ 									&stup->isnull1);
+ 	}
+ 	else
+ 	{
+ 		/*
+ 		 * Extract the index column values and isnull flags from the existing
+ 		 * tuple; we're interested only in the very first one, but to avoid
+ 		 * copy&paste from FormIndexDatum we get all of them (even if it's
+ 		 * slower)
+ 		 */
+ 
+ 		Datum		existing_values[INDEX_MAX_KEYS];
+ 		bool		existing_isnull[INDEX_MAX_KEYS];
+ 		TupleTableSlot *ecxt_scantuple;
+ 
+ 		ecxt_scantuple = GetPerTupleExprContext(state->estate)->ecxt_scantuple;
+ 		ExecStoreTuple(tuple,	ecxt_scantuple, InvalidBuffer, false);
+ 		FormIndexDatum(state->indexInfo, ecxt_scantuple, state->estate,
+ 					   existing_values, existing_isnull);
+ 
+ 		stup->datum1 = existing_values[0];
+ 		stup->isnull1 = existing_isnull[0];
+ 	}
+ }
+ 
+ /*
   * Convenience routine to free a tuple previously loaded into sort memory
   */
  static void
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 376bb1b..e9a1cd3 100644
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
*************** extern void cost_hashjoin(HashPath *path
*** 110,115 ****
--- 110,118 ----
  extern void cost_subplan(PlannerInfo *root, SubPlan *subplan, Plan *plan);
  extern void cost_qual_eval(QualCost *cost, List *quals, PlannerInfo *root);
  extern void cost_qual_eval_node(QualCost *cost, Node *qual, PlannerInfo *root);
+ extern void cost_index_scan_vs_seqscansort(Oid tableOid, Oid indexOid,
+ 			  Cost *indexScanCost, Cost *seqScanAndSortCost);
+ 
  extern void set_baserel_size_estimates(PlannerInfo *root, RelOptInfo *rel);
  extern void set_joinrel_size_estimates(PlannerInfo *root, RelOptInfo *rel,
  						   RelOptInfo *outer_rel,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 87e26e7..d7f5118 100644
*** a/src/include/utils/tuplesort.h
--- b/src/include/utils/tuplesort.h
*************** extern Tuplesortstate *tuplesort_begin_i
*** 64,69 ****
--- 64,72 ----
  extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
  					  Oid sortOperator, bool nullsFirstFlag,
  					  int workMem, bool randomAccess);
+ extern Tuplesortstate *tuplesort_begin_rawheap(Relation indexRel,
+ 												TupleDesc tupDesc,
+ 												int workMem, bool randomAccess);
  
  extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
  
*************** extern void tuplesort_puttupleslot(Tuple
*** 72,77 ****
--- 75,81 ----
  extern void tuplesort_putindextuple(Tuplesortstate *state, IndexTuple tuple);
  extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
  				   bool isNull);
+ extern void tuplesort_putrawtuple(Tuplesortstate *state, HeapTuple tup);
  
  extern void tuplesort_performsort(Tuplesortstate *state);
  
*************** extern IndexTuple tuplesort_getindextupl
*** 81,86 ****
--- 85,92 ----
  						bool *should_free);
  extern bool tuplesort_getdatum(Tuplesortstate *state, bool forward,
  				   Datum *val, bool *isNull);
+ extern HeapTuple tuplesort_getrawtuple(Tuplesortstate *state, bool forward,
+ 				   bool *should_free);
  
  extern void tuplesort_end(Tuplesortstate *state);

#41

http://osdir.com/ml/pgsql-hackers/2010-02/msg00744.html

m_lists@yahoo.it

over 15 years ago

In reply to: Takahiro Itagaki (#40)

Re: I: About "Our CLUSTER implementation is pessimal" patch

I reviewed
your patch. It seems to be in good shape, and worked as
expected. I
suppressed a compiler warning in the patch and cleaned up
whitespaces in it.
Patch attached.

Thanks for the review!

I saw that you also changed the writing:

LogicalTapeWrite(state->tapeset, tapenum,
tuple, HEAPTUPLESIZE);
LogicalTapeWrite(state->tapeset, tapenum, tuple->t_data, tuple->t_len-HEAPTUPLESIZE);

and the reading:

tuple->t_len = tuplen - HEAPTUPLESIZE;
if (LogicalTapeRead(state->tapeset, tapenum, &tuple->t_self, tuplen-sizeof(tuplen)) != tuplen-sizeof(tuplen))
elog(ERROR, "unexpected end of data");

into (writing):

LogicalTapeWrite(state->tapeset, tapenum,
tuple, tuple->t_len);

(reading):

if (LogicalTapeRead(state->tapeset, tapenum, &tuple->t_self, tuplen-sizeof(tuplen)) != tuplen-sizeof(tuplen))

Are we sure it's 100% equivalent?

I remember I had issues with the fact that tuple->t_data wasn't
at HEAPTUPLESIZE distance from tuple:

I think we need some documentation for the change. The
only downside
of the feature is that sorted cluster requires twice disk
spaces of
the target table (temp space for disk sort and the result
table).
Could I ask you to write documentation about the new
behavior?
Also, code comments can be improved; especially we need
better
description than "copy&paste from
FormIndexDatum".

I'll try to improve the comments and add doc changes (but my English
will have to be double checked...)

#42

itagaki.takahiro@oss.ntt.co.jp

over 15 years ago

In reply to: Leonardo F (#41)

Re: I: About "Our CLUSTER implementation is pessimal" patch

Leonardo F <m_lists@yahoo.it> wrote:

I saw that you also changed the writing:

(snip)

Are we sure it's 100% equivalent?

I think writetup_rawheap() and readtup_rawheap() are a little complex,
but should work as long as there are no padding between t_len and t_self
in HeapTupleData struct.

- It might be cleaner if you write the total item length
and tuple data separately.
- "(char *) tuple + sizeof(tuplen)" might be more robust
than "&tuple->t_self".

Here is a sample code. writetup() and readtup() will be alike.

BTW, we could have LogicalTapeReadExact() as an alias of
LogicalTapeRead() and checking the result because we have
many duplicated codes for "unexpected end of data" errors.

static void
writetup_rawheap(Tuplesortstate *state, int tapenum, SortTuple *stup)
{
HeapTuple tuple = (HeapTuple) stup->tuple;
int tuplen = tuple->t_len + HEAPTUPLESIZE;

LogicalTapeWrite(state->tapeset, tapenum,
&tuplen, sizeof(tuplen));
LogicalTapeWrite(state->tapeset, tapenum,
(char *) tuple + sizeof(tuplen),
HEAPTUPLESIZE - sizeof(tuplen);
LogicalTapeWrite(state->tapeset, tapenum, tuple->t_data, tuple->t_len);
if (state->randomAccess) /* need trailing length word? */
LogicalTapeWrite(state->tapeset, tapenum, &tuplen, sizeof(tuplen));

FREEMEM(state, GetMemoryChunkSpace(tuple));
heap_freetuple(tuple);
}

static void
readtup_rawheap(Tuplesortstate *state, SortTuple *stup,
int tapenum, unsigned int tuplen)
{
HeapTuple tuple = (HeapTuple) palloc(tuplen);

USEMEM(state, GetMemoryChunkSpace(tuple));

tuple->t_len = tuplen - HEAPTUPLESIZE;
if (LogicalTapeRead(state->tapeset, tapenum,
(char *) tuple + sizeof(tuplen),
HEAPTUPLESIZE - sizeof(tuplen)) != HEAPTUPLESIZE - sizeof(tuplen))
elog(ERROR, "unexpected end of data");
tuple->t_data = (HeapTupleHeader) ((char *) tuple + HEAPTUPLESIZE);
if (LogicalTapeRead(state->tapeset, tapenum,
tuple->t_data, tuple->t_len) != tuple->t_len)
elog(ERROR, "unexpected end of data");
if (state->randomAccess) /* need trailing length word? */
if (LogicalTapeRead(state->tapeset, tapenum, &tuplen,
sizeof(tuplen)) != sizeof(tuplen))
elog(ERROR, "unexpected end of data");

Regards,
---
Takahiro Itagaki
NTT Open Source Software Center

#43

alvherre@commandprompt.com

over 15 years ago

In reply to: Takahiro Itagaki (#42)

Re: I: About "Our CLUSTER implementation is pessimal" patch

Excerpts from Takahiro Itagaki's message of mié jul 07 04:39:38 -0400 2010:

BTW, we could have LogicalTapeReadExact() as an alias of
LogicalTapeRead() and checking the result because we have
many duplicated codes for "unexpected end of data" errors.

I'd just add a boolean "exact required" to the existing functions.

#44

m_lists@yahoo.it

over 15 years ago

In reply to: Takahiro Itagaki (#42)

1 attachment(s)

Re: I: About "Our CLUSTER implementation is pessimal" patch

I think writetup_rawheap() and readtup_rawheap() are a little complex,
but should work as long as there are no padding between t_len and t_self
in HeapTupleData struct.

- It might be cleaner if you write the total item length
and tuple data separately.
- "(char *) tuple + sizeof(tuplen)" might be more robust
than "&tuple->t_self".

- I used your functions
- changed the docs for CLUSTER (I don't know if they make sense/are enough)
- added a minor comment

2 questions:

1) about the "copy&paste from FormIndexDatum" comment: how can I improve it?
The idea is that we could have a faster call, but it would mean copying and
pasting a lot of code from FormIndexDatum.

2) what other areas can I comment more?

Attachments:

sorted_cluster-20100721.patchapplication/octet-stream; name=sorted_cluster-20100721.patchDownload

Index: doc/src/sgml/ref/cluster.sgml
===================================================================
RCS file: /projects/cvsroot/pgsql/doc/src/sgml/ref/cluster.sgml,v
retrieving revision 1.50
diff -c -r1.50 cluster.sgml
*** doc/src/sgml/ref/cluster.sgml	11 May 2010 16:07:42 -0000	1.50
--- doc/src/sgml/ref/cluster.sgml	21 Jul 2010 14:04:11 -0000
***************
*** 128,138 ****
     </para>
  
     <para>
!     During the cluster operation, a temporary copy of the table is created
!     that contains the table data in the index order.  Temporary copies of
!     each index on the table are created as well.  Therefore, you need free
!     space on disk at least equal to the sum of the table size and the index
!     sizes.
     </para>
  
     <para>
--- 128,159 ----
     </para>
  
     <para>
!     Based on:
!     <itemizedlist spacing="compact">
!      <listitem><para>statistics on the table</para></listitem>
!      <listitem><para>resource consumption config (see <xref linkend="runtime-config-resource">)</para></listitem>
!      <listitem><para>planner method configuration (see <xref linkend="runtime-config-query-enable">)</para></listitem>
!     </itemizedlist>
!     the planner can choose between two different methods to cluster the table:
!     <itemizedlist spacing="compact">
!      <listitem><para>index scan</para></listitem>
!      <listitem><para>table scan followed by sort</para></listitem>
!     </itemizedlist>
!     In the first case (index scan), during the cluster operation, a temporary
!     copy of the table is created that contains the table data in the index
!     order.
!     Temporary copies of each index on the table are created as well.  Therefore,
!     you need free space on disk at least equal to the sum of the table size and
!     the index sizes.
!     In the second case a full table scan is followed by a sort operation.
!     In addition to the free space needed by the previous case, this approach
!     may also need a temporary disk sort file which can be as big as the original
!     table. 
!     It is advisable to run <xref linkend="sql-analyze"> on the table before
!     clustering it, and to set the resource consumption parameters to sensible
!     values (especially <xref linkend="guc-work-mem">). 
!     Otherwise, the planner might make poor choices of the clustering method to
!     use.
     </para>
  
     <para>
***************
*** 149,184 ****
      Otherwise, the planner might make poor choices of query plans.
     </para>
  
-    <para>
-     There is another way to cluster data. The
-     <command>CLUSTER</command> command reorders the original table by
-     scanning it using the index you specify. This can be slow
-     on large tables because the rows are fetched from the table
-     in index order, and if the table is disordered, the
-     entries are on random pages, so there is one disk page
-     retrieved for every row moved. (<productname>PostgreSQL</productname> has
-     a cache, but the majority of a big table will not fit in the cache.)
-     The other way to cluster a table is to use:
- 
- <programlisting>
- CREATE TABLE <replaceable class="parameter">newtable</replaceable> AS
-     SELECT * FROM <replaceable class="parameter">table</replaceable> ORDER BY <replaceable class="parameter">columnlist</replaceable>;
- </programlisting>
- 
-     which uses the <productname>PostgreSQL</productname> sorting code
-     to produce the desired order;
-     this is usually much faster than an index scan for disordered data.
-     Then you drop the old table, use
-     <command>ALTER TABLE ... RENAME</command>
-     to rename <replaceable class="parameter">newtable</replaceable> to the
-     old name, and recreate the table's indexes.
-     The big disadvantage of this approach is that it does not preserve
-     OIDs, constraints, foreign key relationships, granted privileges, and
-     other ancillary properties of the table &mdash; all such items must be
-     manually recreated.  Another disadvantage is that this way requires a sort
-     temporary file about the same size as the table itself, so peak disk usage
-     is about three times the table size instead of twice the table size.
-    </para>
   </refsect1>
  
   <refsect1>
--- 170,175 ----
Index: src/include/optimizer/cost.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/optimizer/cost.h,v
retrieving revision 1.101
diff -c -r1.101 cost.h
*** src/include/optimizer/cost.h	19 Apr 2010 00:55:26 -0000	1.101
--- src/include/optimizer/cost.h	21 Jul 2010 14:04:17 -0000
***************
*** 110,115 ****
--- 110,117 ----
  extern void cost_subplan(PlannerInfo *root, SubPlan *subplan, Plan *plan);
  extern void cost_qual_eval(QualCost *cost, List *quals, PlannerInfo *root);
  extern void cost_qual_eval_node(QualCost *cost, Node *qual, PlannerInfo *root);
+ extern void cost_index_scan_vs_seqscansort(Oid tableOid, Oid indexOid,
+ 			  Cost *indexScanCost, Cost *seqScanAndSortCost);
  extern void set_baserel_size_estimates(PlannerInfo *root, RelOptInfo *rel);
  extern void set_joinrel_size_estimates(PlannerInfo *root, RelOptInfo *rel,
  						   RelOptInfo *outer_rel,
Index: src/include/utils/tuplesort.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/utils/tuplesort.h,v
retrieving revision 1.36
diff -c -r1.36 tuplesort.h
*** src/include/utils/tuplesort.h	26 Feb 2010 02:01:29 -0000	1.36
--- src/include/utils/tuplesort.h	21 Jul 2010 14:04:17 -0000
***************
*** 64,69 ****
--- 64,72 ----
  extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
  					  Oid sortOperator, bool nullsFirstFlag,
  					  int workMem, bool randomAccess);
+ extern Tuplesortstate *tuplesort_begin_rawheap(Relation indexRel,
+ 												TupleDesc tupDesc,
+ 												int workMem, bool randomAccess);
  
  extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
  
***************
*** 72,77 ****
--- 75,81 ----
  extern void tuplesort_putindextuple(Tuplesortstate *state, IndexTuple tuple);
  extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
  				   bool isNull);
+ extern void tuplesort_putrawtuple(Tuplesortstate *state, HeapTuple tup);
  
  extern void tuplesort_performsort(Tuplesortstate *state);
  
***************
*** 81,86 ****
--- 85,92 ----
  						bool *should_free);
  extern bool tuplesort_getdatum(Tuplesortstate *state, bool forward,
  				   Datum *val, bool *isNull);
+ extern HeapTuple tuplesort_getrawtuple(Tuplesortstate *state, bool forward,
+ 				   bool *should_free);
  
  extern void tuplesort_end(Tuplesortstate *state);
  
Index: src/backend/optimizer/path/costsize.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/optimizer/path/costsize.c,v
retrieving revision 1.218
diff -c -r1.218 costsize.c
*** src/backend/optimizer/path/costsize.c	6 Jul 2010 19:18:56 -0000	1.218
--- src/backend/optimizer/path/costsize.c	21 Jul 2010 14:04:14 -0000
***************
*** 76,81 ****
--- 76,82 ----
  #include "optimizer/cost.h"
  #include "optimizer/pathnode.h"
  #include "optimizer/placeholder.h"
+ #include "optimizer/plancat.h"
  #include "optimizer/planmain.h"
  #include "optimizer/restrictinfo.h"
  #include "parser/parsetree.h"
***************
*** 2672,2677 ****
--- 2673,2749 ----
  								  (void *) context);
  }
  
+ /*
+  * cost_index_scan_vs_seqscansort
+  *		Estimates and returns the cost of an full-index scan and a sort after
+  *		a sequential scan.
+  */
+ void
+ cost_index_scan_vs_seqscansort(Oid tableOid,
+ 							   Oid indexOid,
+ 							   Cost *indexScanCost,
+ 							   Cost *seqScanAndSortCost)
+ {
+ 	RelOptInfo	   *rel;
+ 	PlannerInfo	   *root;
+ 	Query		   *query;
+ 	PlannerGlobal  *glob;
+ 	RangeTblEntry  *rte;
+ 	ListCell	   *index;
+ 	IndexPath	   *indexScanPath;
+ 	Path		   *seqScanAndSortPath;
+ 
+ 	/* make a dummy planner */
+ 	glob = makeNode(PlannerGlobal);
+ 
+ 	query = makeNode(Query);
+ 	query->resultRelation = 0;
+ 
+ 	root = makeNode(PlannerInfo);
+ 	root->parse = query;
+ 	root->glob = glob;
+ 
+ 	rel = makeNode(RelOptInfo);
+ 	rel->reloptkind = RELOPT_BASEREL;
+ 	rel->relid = 1;
+ 	rel->rtekind = RTE_RELATION;
+ 	get_relation_info(root, tableOid, false, rel);
+ 	rel->rows = rel->tuples;
+ 
+ 	root->total_table_pages = rel->pages;
+ 
+ 	rte = makeNode(RangeTblEntry);
+ 	rte->rtekind = RTE_RELATION;
+ 	rte->relid = tableOid;
+ 
+ 	root->simple_rel_array_size = 2;
+ 	root->simple_rte_array = (RangeTblEntry **)
+ 		palloc0(root->simple_rel_array_size * sizeof(RangeTblEntry *));
+ 	root->simple_rte_array[1] = rte;
+ 
+ 	/* estimate the cost of seq scan + sort */
+ 	seqScanAndSortPath = create_seqscan_path(NULL, rel);
+ 	cost_sort(seqScanAndSortPath, root, NULL, seqScanAndSortPath->total_cost,
+ 			  rel->tuples, rel->width, -1);
+ 
+ 	/* estimate the cost of index scan */
+ 	indexScanPath = NULL;
+ 	foreach(index, rel->indexlist)
+ 	{
+ 		IndexOptInfo *indexInfo = (IndexOptInfo*) lfirst(index);
+ 		if (indexInfo->indexoid == indexOid)
+ 		{
+ 			indexScanPath = create_index_path(root, indexInfo, NULL, NULL,
+ 											  ForwardScanDirection, NULL);
+ 			break;
+ 		}
+ 	}
+ 	Assert(indexScanPath != NULL);
+ 
+ 	*indexScanCost = indexScanPath->path.total_cost;
+ 	*seqScanAndSortCost = seqScanAndSortPath->total_cost;
+ }
+ 
  
  /*
   * adjust_semi_join
Index: src/backend/commands/cluster.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/commands/cluster.c,v
retrieving revision 1.203
diff -c -r1.203 cluster.c
*** src/backend/commands/cluster.c	28 Apr 2010 16:10:41 -0000	1.203
--- src/backend/commands/cluster.c	21 Jul 2010 14:04:13 -0000
***************
*** 36,41 ****
--- 36,42 ----
  #include "commands/trigger.h"
  #include "commands/vacuum.h"
  #include "miscadmin.h"
+ #include "optimizer/cost.h"
  #include "storage/bufmgr.h"
  #include "storage/procarray.h"
  #include "storage/smgr.h"
***************
*** 49,54 ****
--- 50,56 ----
  #include "utils/snapmgr.h"
  #include "utils/syscache.h"
  #include "utils/tqual.h"
+ #include "utils/tuplesort.h"
  
  
  /*
***************
*** 69,74 ****
--- 71,79 ----
  			   int freeze_min_age, int freeze_table_age,
  			   bool *pSwapToastByContent, TransactionId *pFreezeXid);
  static List *get_tables_to_cluster(MemoryContext cluster_context);
+ static void deform_and_rewrite_tuple(HeapTuple tuple, TupleDesc oldTupDesc,
+ 			   TupleDesc newTupDesc, Datum *values, bool *isnull,
+ 			   bool newRelHasOids, RewriteState rwstate);
  
  
  
***************
*** 757,762 ****
--- 762,769 ----
  	TransactionId OldestXmin;
  	TransactionId FreezeXid;
  	RewriteState rwstate;
+ 	bool 		 use_sort;
+ 	Tuplesortstate *tuplesort;
  
  	/*
  	 * Open the relations we need.
***************
*** 848,876 ****
  	 * tuples that still need to be copied, we scan with SnapshotAny and use
  	 * HeapTupleSatisfiesVacuum for the visibility test.
  	 */
! 	if (OldIndex != NULL)
  	{
  		heapScan = NULL;
  		indexScan = index_beginscan(OldHeap, OldIndex,
  									SnapshotAny, 0, (ScanKey) NULL);
  	}
  	else
  	{
  		heapScan = heap_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL);
  		indexScan = NULL;
  	}
  
  	for (;;)
  	{
  		HeapTuple	tuple;
- 		HeapTuple	copiedTuple;
  		Buffer		buf;
  		bool		isdead;
- 		int			i;
  
  		CHECK_FOR_INTERRUPTS();
  
! 		if (OldIndex != NULL)
  		{
  			tuple = index_getnext(indexScan, ForwardScanDirection);
  			if (tuple == NULL)
--- 855,903 ----
  	 * tuples that still need to be copied, we scan with SnapshotAny and use
  	 * HeapTupleSatisfiesVacuum for the visibility test.
  	 */
! 	if (OldIndex != NULL && OldIndex->rd_rel->relam == BTREE_AM_OID)
! 	{
! 		/*
! 		 * Check to see if a scan+sort would be less expensive than an index
! 		 * scan
! 		 */
! 		Cost 	indexScanCost,
! 				seqScanAndSortCost;
! 
! 		cost_index_scan_vs_seqscansort(OIDOldHeap, OIDOldIndex,
! 									   &indexScanCost, &seqScanAndSortCost);
! 		use_sort = seqScanAndSortCost < indexScanCost;
! 	}
! 	else
! 		use_sort = false;
! 
! 	if (OldIndex != NULL && !use_sort)
  	{
  		heapScan = NULL;
  		indexScan = index_beginscan(OldHeap, OldIndex,
  									SnapshotAny, 0, (ScanKey) NULL);
+ 		tuplesort = NULL;
  	}
  	else
  	{
  		heapScan = heap_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL);
  		indexScan = NULL;
+ 		if (use_sort)
+ 			tuplesort = tuplesort_begin_rawheap(OldIndex, oldTupDesc,
+ 												maintenance_work_mem, false);
+ 		else
+ 			tuplesort = NULL;
  	}
  
  	for (;;)
  	{
  		HeapTuple	tuple;
  		Buffer		buf;
  		bool		isdead;
  
  		CHECK_FOR_INTERRUPTS();
  
! 		if (indexScan != NULL)
  		{
  			tuple = index_getnext(indexScan, ForwardScanDirection);
  			if (tuple == NULL)
***************
*** 949,993 ****
  			continue;
  		}
  
! 		/*
! 		 * We cannot simply copy the tuple as-is, for several reasons:
! 		 *
! 		 * 1. We'd like to squeeze out the values of any dropped columns, both
! 		 * to save space and to ensure we have no corner-case failures. (It's
! 		 * possible for example that the new table hasn't got a TOAST table
! 		 * and so is unable to store any large values of dropped cols.)
! 		 *
! 		 * 2. The tuple might not even be legal for the new table; this is
! 		 * currently only known to happen as an after-effect of ALTER TABLE
! 		 * SET WITHOUT OIDS.
! 		 *
! 		 * So, we must reconstruct the tuple from component Datums.
! 		 */
! 		heap_deform_tuple(tuple, oldTupDesc, values, isnull);
  
! 		/* Be sure to null out any dropped columns */
! 		for (i = 0; i < natts; i++)
! 		{
! 			if (newTupDesc->attrs[i]->attisdropped)
! 				isnull[i] = true;
! 		}
  
! 		copiedTuple = heap_form_tuple(newTupDesc, values, isnull);
  
! 		/* Preserve OID, if any */
! 		if (NewHeap->rd_rel->relhasoids)
! 			HeapTupleSetOid(copiedTuple, HeapTupleGetOid(tuple));
  
! 		/* The heap rewrite module does the rest */
! 		rewrite_heap_tuple(rwstate, tuple, copiedTuple);
  
! 		heap_freetuple(copiedTuple);
  	}
  
! 	if (OldIndex != NULL)
  		index_endscan(indexScan);
! 	else
  		heap_endscan(heapScan);
  
  	/* Write out any remaining tuples, and fsync if needed */
  	end_heap_rewrite(rwstate);
--- 976,1022 ----
  			continue;
  		}
  
! 		if (tuplesort != NULL)
! 			tuplesort_putrawtuple(tuplesort, tuple);
! 		else
! 			deform_and_rewrite_tuple(tuple, oldTupDesc, newTupDesc,
! 									 values, isnull,
! 									 NewHeap->rd_rel->relhasoids, rwstate);
! 	}
  
! 	/*
! 	 * In scan-and-sort mode, read all tuples from tuplestore and write out
! 	 * them into the new relation.
! 	 */
! 	if (tuplesort != NULL)
! 	{
! 		tuplesort_performsort(tuplesort);
  
! 		for (;;)
! 		{
! 			HeapTuple	tuple;
! 			bool		shouldfree;
  
! 			CHECK_FOR_INTERRUPTS();
  
! 			tuple = tuplesort_getrawtuple(tuplesort, true, &shouldfree);
! 			if (tuple == NULL)
! 				break;
  
! 			deform_and_rewrite_tuple(tuple, oldTupDesc, newTupDesc,
! 									 values, isnull,
! 									 NewHeap->rd_rel->relhasoids, rwstate);
! 			if (shouldfree)
! 				heap_freetuple(tuple);
! 		}
  	}
  
! 	if (indexScan != NULL)
  		index_endscan(indexScan);
! 	if (heapScan != NULL)
  		heap_endscan(heapScan);
+ 	if (tuplesort != NULL)
+ 		tuplesort_end(tuplesort);
  
  	/* Write out any remaining tuples, and fsync if needed */
  	end_heap_rewrite(rwstate);
***************
*** 1486,1488 ****
--- 1515,1561 ----
  
  	return rvs;
  }
+ 
+ static void deform_and_rewrite_tuple(HeapTuple tuple, TupleDesc oldTupDesc, TupleDesc newTupDesc,
+ 										Datum *values, bool *isnull,
+ 										bool newRelHasOids, RewriteState rwstate)
+ {
+ 	HeapTuple	copiedTuple;
+ 	int 		i;
+ 
+ 	/*
+ 	 * We cannot simply copy the tuple as-is, for several reasons:
+ 	 *
+ 	 * 1. We'd like to squeeze out the values of any dropped columns, both
+ 	 * to save space and to ensure we have no corner-case failures. (It's
+ 	 * possible for example that the new table hasn't got a TOAST table
+ 	 * and so is unable to store any large values of dropped cols.)
+ 	 *
+ 	 * 2. The tuple might not even be legal for the new table; this is
+ 	 * currently only known to happen as an after-effect of ALTER TABLE
+ 	 * SET WITHOUT OIDS.
+ 	 *
+ 	 * So, we must reconstruct the tuple from component Datums.
+ 	 */
+ 
+ 	heap_deform_tuple(tuple, oldTupDesc, values, isnull);
+ 
+ 	/* Be sure to null out any dropped columns */
+ 	for (i = 0; i < newTupDesc->natts; i++)
+ 	{
+ 		if (newTupDesc->attrs[i]->attisdropped)
+ 			isnull[i] = true;
+ 	}
+ 
+ 	copiedTuple = heap_form_tuple(newTupDesc, values, isnull);
+ 
+ 	/* Preserve OID, if any */
+ 	if (newRelHasOids)
+ 		HeapTupleSetOid(copiedTuple, HeapTupleGetOid(tuple));
+ 
+ 	/* The heap rewrite module does the rest */
+ 	rewrite_heap_tuple(rwstate, tuple, copiedTuple);
+ 
+ 	heap_freetuple(copiedTuple);
+ }
+ 
Index: src/backend/utils/sort/tuplesort.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/utils/sort/tuplesort.c,v
retrieving revision 1.95
diff -c -r1.95 tuplesort.c
*** src/backend/utils/sort/tuplesort.c	26 Feb 2010 02:01:15 -0000	1.95
--- src/backend/utils/sort/tuplesort.c	21 Jul 2010 14:04:16 -0000
***************
*** 104,109 ****
--- 104,110 ----
  #include "access/nbtree.h"
  #include "catalog/pg_amop.h"
  #include "catalog/pg_operator.h"
+ #include "catalog/index.h"
  #include "commands/tablespace.h"
  #include "miscadmin.h"
  #include "pg_trace.h"
***************
*** 115,126 ****
--- 116,129 ----
  #include "utils/rel.h"
  #include "utils/syscache.h"
  #include "utils/tuplesort.h"
+ #include "executor/executor.h"
  
  
  /* sort-type codes for sort__start probes */
  #define HEAP_SORT	0
  #define INDEX_SORT	1
  #define DATUM_SORT	2
+ #define RAWHEAP_SORT	3
  
  /* GUC variables */
  #ifdef TRACE_SORT
***************
*** 366,371 ****
--- 369,378 ----
  	int			datumTypeLen;
  	bool		datumTypeByVal;
  
+ 	/* These are specific to the rawheap subcase: */
+ 	EState 	   *estate;
+ 	IndexInfo  *indexInfo;
+ 
  	/*
  	 * Resource snapshot for time of sort start.
  	 */
***************
*** 450,455 ****
--- 457,468 ----
  static void readtup_heap(Tuplesortstate *state, SortTuple *stup,
  			 int tapenum, unsigned int len);
  static void reversedirection_heap(Tuplesortstate *state);
+ static int comparetup_rawheap(const SortTuple *a, const SortTuple *b,
+ 							  Tuplesortstate *state);
+ static void copytup_rawheap(Tuplesortstate *state, SortTuple *stup, void *tup);
+ static void writetup_rawheap(Tuplesortstate *state, int tapenum, SortTuple *stup);
+ static void readtup_rawheap(Tuplesortstate *state, SortTuple *stup,
+ 							int tapenum, unsigned int len);
  static int comparetup_index_btree(const SortTuple *a, const SortTuple *b,
  					   Tuplesortstate *state);
  static int comparetup_index_hash(const SortTuple *a, const SortTuple *b,
***************
*** 549,554 ****
--- 562,570 ----
  
  	state->result_tape = -1;	/* flag that result tape has not been formed */
  
+ 	/* set estate to NULL, so we don't try to free it later if not used */
+ 	state->estate = NULL;
+ 
  	MemoryContextSwitchTo(oldcontext);
  
  	return state;
***************
*** 762,767 ****
--- 778,843 ----
  	return state;
  }
  
+ Tuplesortstate *
+ tuplesort_begin_rawheap(Relation indexRel, TupleDesc tupDesc,
+ 						int workMem, bool randomAccess)
+ {
+ 	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+ 	MemoryContext oldcontext;
+ 
+ 	TupleTableSlot *existing_slot;
+ 	ExprContext	   *econtext;
+ 
+ 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
+ 
+ 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ 
+ #ifdef TRACE_SORT
+ 	if (trace_sort)
+ 		elog(LOG,
+ 			 "begin tuple sort: nkeys = %d, workMem = %d, randomAccess = %c",
+ 			 RelationGetNumberOfAttributes(indexRel), workMem, randomAccess ? 't' : 'f');
+ #endif
+ 
+ 	state->nKeys = RelationGetNumberOfAttributes(indexRel);
+ 
+ 	TRACE_POSTGRESQL_SORT_START(RAWHEAP_SORT, false, state->nKeys, workMem, randomAccess);
+ 
+ 	state->comparetup = comparetup_rawheap;
+ 	state->copytup = copytup_rawheap;
+ 	state->writetup = writetup_rawheap;
+ 	state->readtup = readtup_rawheap;
+ 	state->reversedirection = reversedirection_heap;
+ 
+ 	state->indexInfo = BuildIndexInfo(indexRel);
+ 	state->indexScanKey = _bt_mkscankey_nodata(indexRel);
+ 	state->enforceUnique = false;
+ 
+ 	state->tupDesc = tupDesc;	/* assume we need not copy tupDesc */
+ 
+ 	if (state->indexInfo->ii_Expressions != NULL)
+ 	{
+ 		/* allocate the vars used by FormIndexDatum */
+ 		state->estate = CreateExecutorState();
+ 
+ 		/*
+ 		 * Need a TupleTableSlot to put existing tuples in.
+ 		 *
+ 		 * To use FormIndexDatum, we have to make the econtext's scantuple point
+ 		 * to this slot.  Be sure to save and restore caller's value for
+ 		 * scantuple.
+ 		 */
+ 		existing_slot = MakeSingleTupleTableSlot(tupDesc);
+ 
+ 		econtext = GetPerTupleExprContext(state->estate);
+ 		econtext->ecxt_scantuple = existing_slot;
+ 	}
+ 
+ 	MemoryContextSwitchTo(oldcontext);
+ 
+ 	return state;
+ }
+ 
  /*
   * tuplesort_set_bound
   *
***************
*** 849,854 ****
--- 925,933 ----
  	 */
  	TRACE_POSTGRESQL_SORT_DONE(state->tapeset != NULL, 0L);
  #endif
+ 	if (state->estate != NULL)
+ 		ExecDropSingleTupleTableSlot(GetPerTupleExprContext(state->estate)->ecxt_scantuple);
+ 
  
  	MemoryContextSwitchTo(oldcontext);
  
***************
*** 980,985 ****
--- 1059,1086 ----
  }
  
  /*
+ * Accept one tuple while collecting input data for sort.
+ *
+ * Note that the input data is always copied; the caller need not save it.
+ */
+ void
+ tuplesort_putrawtuple(Tuplesortstate *state, HeapTuple tup)
+ {
+ 	MemoryContext	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ 	SortTuple		stup;
+ 
+ 	/*
+ 	 * Copy the given tuple into memory we control, and decrease availMem.
+ 	 * Then call the common code.
+ 	 */
+ 	COPYTUP(state, &stup, (void *) tup);
+ 
+ 	puttuple_common(state, &stup);
+ 
+ 	MemoryContextSwitchTo(oldcontext);
+ }
+ 
+ /*
   * Shared code for tuple and datum cases.
   */
  static void
***************
*** 1482,1487 ****
--- 1583,1607 ----
  }
  
  /*
+ * Fetch the next tuple in either forward or back direction. Returns NULL if
+ * no more tuples. If *should_free is set, the caller must pfree the returned
+ * tuple when done with it.
+ */
+ HeapTuple
+ tuplesort_getrawtuple(Tuplesortstate *state, bool forward, bool *should_free)
+ {
+ 	MemoryContext	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ 	SortTuple		stup;
+ 
+ 	if (!tuplesort_gettuple_common(state, forward, &stup, should_free))
+ 		stup.tuple = NULL;
+ 
+ 	MemoryContextSwitchTo(oldcontext);
+ 
+ 	return stup.tuple;
+ }
+ 
+ /*
   * tuplesort_merge_order - report merge order we'll use for given memory
   * (note: "merge order" just means the number of input tapes in the merge).
   *
***************
*** 3079,3084 ****
--- 3199,3414 ----
  }
  
  /*
+ * Routines specialized for Raw on-disk HeapTuple case These are only used when
+ * we need the visibility info for things like CLUSTER. Otherwise we used the
+ * regular *tup_heap methods which actually manipulate MinimalTuples.
+ */
+ static int
+ comparetup_rawheap(const SortTuple *a, const SortTuple *b, Tuplesortstate *state)
+ {
+ 	ScanKey		scanKey = state->indexScanKey;
+ 	HeapTuple	ltup;
+ 	HeapTuple	rtup;
+ 	TupleDesc	tupDesc;
+ 	int			nkey;
+ 	int32		compare;
+ 
+ 	/* Allow interrupting long sorts */
+ 	CHECK_FOR_INTERRUPTS();
+ 
+ 	/* Compare the leading sort key */
+ 	compare = inlineApplySortFunction(&scanKey->sk_func,
+ 									  scanKey->sk_flags,
+ 									  a->datum1, a->isnull1,
+ 									  b->datum1, b->isnull1);
+ 	if (compare != 0)
+ 		return compare;
+ 
+ 	/* Compare additional sort keys */
+ 	ltup = (HeapTuple) a->tuple;
+ 	rtup = (HeapTuple) b->tuple;
+ 
+ 	if (state->indexInfo->ii_Expressions == NULL)
+ 	{
+ 		/* if not expression index, just get the proper heap_getattr */
+ 
+ 		tupDesc = state->tupDesc;
+ 		scanKey++;
+ 
+ 		for (nkey = 1; nkey < state->nKeys; nkey++, scanKey++)
+ 		{
+ 			Datum		datum1,
+ 						datum2;
+ 			bool		isnull1,
+ 						isnull2;
+ 
+ 			datum1 = heap_getattr(ltup, state->indexInfo->ii_KeyAttrNumbers[nkey], tupDesc, &isnull1);
+ 			datum2 = heap_getattr(rtup, state->indexInfo->ii_KeyAttrNumbers[nkey], tupDesc, &isnull2);
+ 
+ 			compare = inlineApplySortFunction(&scanKey->sk_func,
+ 											  scanKey->sk_flags,
+ 											  datum1, isnull1,
+ 											  datum2, isnull2);
+ 			if (compare != 0)
+ 				return compare;
+ 		}
+ 	}
+ 	else
+ 	{
+ 		/*
+ 		 * in the expression index case, we get all the values/nulls:
+ 		 * it would be faster to get only the required ones, but it would mean
+ 		 * copy&paste from FormIndexDatum
+ 		 */
+ 
+ 		Datum		l_existing_values[INDEX_MAX_KEYS];
+ 		bool		l_existing_isnull[INDEX_MAX_KEYS];
+ 		Datum		r_existing_values[INDEX_MAX_KEYS];
+ 		bool		r_existing_isnull[INDEX_MAX_KEYS];
+ 		TupleTableSlot *ecxt_scantuple;
+ 
+ 		ecxt_scantuple = GetPerTupleExprContext(state->estate)->ecxt_scantuple;
+ 
+ 		ExecStoreTuple(ltup, ecxt_scantuple, InvalidBuffer, false);
+ 		FormIndexDatum(state->indexInfo, ecxt_scantuple, state->estate,
+ 					   l_existing_values, l_existing_isnull);
+ 
+ 		ExecStoreTuple(rtup, ecxt_scantuple, InvalidBuffer, false);
+ 		FormIndexDatum(state->indexInfo, ecxt_scantuple, state->estate,
+ 					   r_existing_values, r_existing_isnull);
+ 
+ 		for (nkey = 1; nkey < state->nKeys; nkey++, scanKey++)
+ 		{
+ 			compare = inlineApplySortFunction(&scanKey->sk_func,
+ 											  scanKey->sk_flags,
+ 											  l_existing_values[nkey],
+ 											  l_existing_isnull[nkey],
+ 											  r_existing_values[nkey],
+ 											  r_existing_isnull[nkey]);
+ 			if (compare != 0)
+ 				return compare;
+ 		}
+ 	}
+ 
+ 	return 0;
+ }
+ 
+ static void
+ copytup_rawheap(Tuplesortstate *state, SortTuple *stup, void *tup)
+ {
+ 	HeapTuple	tuple = (HeapTuple) tup;
+ 
+ 	/* copy the tuple into sort storage */
+ 	stup->tuple = (void *) heap_copytuple(tuple);
+ 	USEMEM(state, GetMemoryChunkSpace(stup->tuple));
+ 	/* set up first-column key value */
+ 	if (state->indexInfo->ii_Expressions == NULL)
+ 	{
+ 		/* no expression index, just get the key datum value */
+ 		stup->datum1 = heap_getattr((HeapTuple) stup->tuple,
+ 									state->indexInfo->ii_KeyAttrNumbers[0],
+ 									state->tupDesc,
+ 									&stup->isnull1);
+ 	}
+ 	else
+ 	{
+ 		/*
+ 		 * Extract the index column values and isnull flags from the existing
+ 		 * tuple; we're interested only in the very first one, but to avoid
+ 		 * copy&paste from FormIndexDatum we get all of them (even if it's
+ 		 * slower)
+ 		 */
+ 
+ 		Datum		existing_values[INDEX_MAX_KEYS];
+ 		bool		existing_isnull[INDEX_MAX_KEYS];
+ 		TupleTableSlot *ecxt_scantuple;
+ 
+ 		ecxt_scantuple = GetPerTupleExprContext(state->estate)->ecxt_scantuple;
+ 		ExecStoreTuple(tuple,	ecxt_scantuple, InvalidBuffer, false);
+ 		FormIndexDatum(state->indexInfo, ecxt_scantuple, state->estate,
+ 					   existing_values, existing_isnull);
+ 
+ 		stup->datum1 = existing_values[0];
+ 		stup->isnull1 = existing_isnull[0];
+ 	}
+ }
+ 
+ static void
+ writetup_rawheap(Tuplesortstate *state, int tapenum, SortTuple *stup)
+ {
+ 	HeapTuple	tuple = (HeapTuple) stup->tuple;
+ 	int	    	tuplen = tuple->t_len + HEAPTUPLESIZE;
+ 
+ 	LogicalTapeWrite(state->tapeset, tapenum,
+ 					 &tuplen, sizeof(tuplen));
+ 	LogicalTapeWrite(state->tapeset, tapenum,
+ 					 (char *) tuple + sizeof(tuplen),
+ 					 HEAPTUPLESIZE - sizeof(tuplen));
+ 	LogicalTapeWrite(state->tapeset, tapenum, tuple->t_data, tuple->t_len);
+ 	if (state->randomAccess)	/* need trailing length word? */
+ 		LogicalTapeWrite(state->tapeset, tapenum, &tuplen, sizeof(tuplen));
+ 
+ 	FREEMEM(state, GetMemoryChunkSpace(tuple));
+ 	heap_freetuple(tuple);
+ }
+ 
+ static void
+ readtup_rawheap(Tuplesortstate *state, SortTuple *stup,
+ 				int tapenum, unsigned int tuplen)
+ {
+ 	HeapTuple	tuple = (HeapTuple) palloc(tuplen);
+ 
+ 	USEMEM(state, GetMemoryChunkSpace(tuple));
+ 
+ 	tuple->t_len = tuplen - HEAPTUPLESIZE;
+ 	if (LogicalTapeRead(state->tapeset, tapenum,
+ 						 (char *) tuple + sizeof(tuplen),
+ 			HEAPTUPLESIZE - sizeof(tuplen)) != HEAPTUPLESIZE - sizeof(tuplen))
+ 		elog(ERROR, "unexpected end of data");
+ 	tuple->t_data = (HeapTupleHeader) ((char *) tuple + HEAPTUPLESIZE);
+ 	if (LogicalTapeRead(state->tapeset, tapenum,
+ 						tuple->t_data, tuple->t_len) != tuple->t_len)
+ 		elog(ERROR, "unexpected end of data");
+ 	if (state->randomAccess)	/* need trailing length word? */
+ 		if (LogicalTapeRead(state->tapeset, tapenum, &tuplen,
+ 							sizeof(tuplen)) != sizeof(tuplen))
+ 			elog(ERROR, "unexpected end of data");
+ 
+ 	stup->tuple = tuple;
+ 
+ 	/* set up first-column key value */
+ 	if (state->indexInfo->ii_Expressions == NULL)
+ 	{
+ 		/* no expression index, just get the key datum value */
+ 		stup->datum1 = heap_getattr((HeapTuple) stup->tuple,
+ 									state->indexInfo->ii_KeyAttrNumbers[0],
+ 									state->tupDesc,
+ 									&stup->isnull1);
+ 	}
+ 	else
+ 	{
+ 		/*
+ 		 * Extract the index column values and isnull flags from the existing
+ 		 * tuple; we're interested only in the very first one, but to avoid
+ 		 * copy&paste from FormIndexDatum we get all of them (even if it's
+ 		 * slower)
+ 		 */
+ 
+ 		Datum		existing_values[INDEX_MAX_KEYS];
+ 		bool		existing_isnull[INDEX_MAX_KEYS];
+ 		TupleTableSlot *ecxt_scantuple;
+ 
+ 		ecxt_scantuple = GetPerTupleExprContext(state->estate)->ecxt_scantuple;
+ 		ExecStoreTuple(tuple,	ecxt_scantuple, InvalidBuffer, false);
+ 		FormIndexDatum(state->indexInfo, ecxt_scantuple, state->estate,
+ 					   existing_values, existing_isnull);
+ 
+ 		stup->datum1 = existing_values[0];
+ 		stup->isnull1 = existing_isnull[0];
+ 	}
+ }
+ 
+ /*
   * Convenience routine to free a tuple previously loaded into sort memory
   */
  static void

#45

itagaki.takahiro@gmail.com

over 15 years ago

In reply to: Leonardo Francalanci (#44)

Re: I: About "Our CLUSTER implementation is pessimal" patch

Sorry for the very delayed review.

On Wed, Jul 21, 2010 at 11:15 PM, Leonardo Francalanci <m_lists@yahoo.it> wrote:

2) what other areas can I comment more?

I think the patch is almost ready to commit, but still
have some comments for the usability and documentations.
I hope native English speakers would help improving docs.

* Documentation could be a bit more simplified like as
"CLUSTER requires twice disk spaces of your original table".
The added description seems too difficult for standard users.

* How to determine which method was used?
We can get some information from trace_sort logs,
but CLUSTER VERBOSE would be better to log
which CLUSTER method was used.

* How to control which method will be used?
It might be good to explain we can control the method
with enable_seqscan/indexscan.

--
Itagaki Takahiro

#46

itagaki.takahiro@gmail.com

over 15 years ago

In reply to: Itagaki Takahiro (#45)

1 attachment(s)

Re: I: About "Our CLUSTER implementation is pessimal" patch

On Tue, Aug 31, 2010 at 8:04 PM, Itagaki Takahiro
<itagaki.takahiro@gmail.com> wrote:

* How to determine which method was used?
We can get some information from trace_sort logs,
but CLUSTER VERBOSE would be better to log
which CLUSTER method was used.

I wrote a patch to improve CLUSTER VERBOSE (and VACUUM FULL VERBOSE).
The patch should be applied after sorted_cluster-20100721.patch .

* clustering "schema.table" by index scan on "index"
* clustering "schema.table" by sequential scan and sort

It also adds VACUUM VERBOSE-like logs:
INFO: "table": found 1 removable, 100001 nonremovable row versions in
1655 pages
DETAIL: 1 dead row versions cannot be removed yet.
CPU 0.03s/0.06u sec elapsed 0.21 sec.

Note that the patch says nothing about reindexing. But if
required, I'm willing to add some VERBOSE messages for
indexes (ex. REINDEX VERBOSE)

--
Itagaki Takahiro

Attachments:

cluster_verbose-20100901.patchapplication/octet-stream; name=cluster_verbose-20100901.patchDownload

diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 99235da..e110a66 100644
*** a/src/backend/access/heap/rewriteheap.c
--- b/src/backend/access/heap/rewriteheap.c
*************** rewrite_heap_tuple(RewriteState state,
*** 501,509 ****
  /*
   * Register a dead tuple with an ongoing rewrite. Dead tuples are not
   * copied to the new table, but we still make note of them so that we
!  * can release some resources earlier.
   */
! void
  rewrite_heap_dead_tuple(RewriteState state, HeapTuple old_tuple)
  {
  	/*
--- 501,509 ----
  /*
   * Register a dead tuple with an ongoing rewrite. Dead tuples are not
   * copied to the new table, but we still make note of them so that we
!  * can release some resources earlier. Returns true if actually registered.
   */
! bool
  rewrite_heap_dead_tuple(RewriteState state, HeapTuple old_tuple)
  {
  	/*
*************** rewrite_heap_dead_tuple(RewriteState sta
*** 539,545 ****
--- 539,549 ----
  		hash_search(state->rs_unresolved_tups, &hashkey,
  					HASH_REMOVE, &found);
  		Assert(found);
+ 
+ 		return true;
  	}
+ 
+ 	return false;
  }
  
  /*
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 383d7cf..2041a85 100644
*** a/src/backend/commands/cluster.c
--- b/src/backend/commands/cluster.c
***************
*** 45,50 ****
--- 45,51 ----
  #include "utils/inval.h"
  #include "utils/lsyscache.h"
  #include "utils/memutils.h"
+ #include "utils/pg_rusage.h"
  #include "utils/relcache.h"
  #include "utils/relmapper.h"
  #include "utils/snapmgr.h"
*************** typedef struct
*** 66,75 ****
  
  
  static void rebuild_relation(Relation OldHeap, Oid indexOid,
! 				 int freeze_min_age, int freeze_table_age);
  static void copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex,
  			   int freeze_min_age, int freeze_table_age,
! 			   bool *pSwapToastByContent, TransactionId *pFreezeXid);
  static List *get_tables_to_cluster(MemoryContext cluster_context);
  static void deform_and_rewrite_tuple(HeapTuple tuple, TupleDesc oldTupDesc,
  			   TupleDesc newTupDesc, Datum *values, bool *isnull,
--- 67,77 ----
  
  
  static void rebuild_relation(Relation OldHeap, Oid indexOid,
! 				 int freeze_min_age, int freeze_table_age, bool verbose);
  static void copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex,
  			   int freeze_min_age, int freeze_table_age,
! 			   bool *pSwapToastByContent, TransactionId *pFreezeXid,
! 			   bool verbose);
  static List *get_tables_to_cluster(MemoryContext cluster_context);
  static void deform_and_rewrite_tuple(HeapTuple tuple, TupleDesc oldTupDesc,
  			   TupleDesc newTupDesc, Datum *values, bool *isnull,
*************** cluster_rel(Oid tableOid, Oid indexOid, 
*** 383,402 ****
  	if (OidIsValid(indexOid))
  		check_index_is_clusterable(OldHeap, indexOid, recheck, AccessExclusiveLock);
  
- 	/* Log what we're doing (this could use more effort) */
- 	if (OidIsValid(indexOid))
- 		ereport(verbose ? INFO : DEBUG2,
- 				(errmsg("clustering \"%s.%s\"",
- 						get_namespace_name(RelationGetNamespace(OldHeap)),
- 						RelationGetRelationName(OldHeap))));
- 	else
- 		ereport(verbose ? INFO : DEBUG2,
- 				(errmsg("vacuuming \"%s.%s\"",
- 						get_namespace_name(RelationGetNamespace(OldHeap)),
- 						RelationGetRelationName(OldHeap))));
- 
  	/* rebuild_relation does all the dirty work */
! 	rebuild_relation(OldHeap, indexOid, freeze_min_age, freeze_table_age);
  
  	/* NB: rebuild_relation does heap_close() on OldHeap */
  }
--- 385,392 ----
  	if (OidIsValid(indexOid))
  		check_index_is_clusterable(OldHeap, indexOid, recheck, AccessExclusiveLock);
  
  	/* rebuild_relation does all the dirty work */
! 	rebuild_relation(OldHeap, indexOid, freeze_min_age, freeze_table_age, verbose);
  
  	/* NB: rebuild_relation does heap_close() on OldHeap */
  }
*************** mark_index_clustered(Relation rel, Oid i
*** 580,586 ****
   */
  static void
  rebuild_relation(Relation OldHeap, Oid indexOid,
! 				 int freeze_min_age, int freeze_table_age)
  {
  	Oid			tableOid = RelationGetRelid(OldHeap);
  	Oid			tableSpace = OldHeap->rd_rel->reltablespace;
--- 570,576 ----
   */
  static void
  rebuild_relation(Relation OldHeap, Oid indexOid,
! 				 int freeze_min_age, int freeze_table_age, bool verbose)
  {
  	Oid			tableOid = RelationGetRelid(OldHeap);
  	Oid			tableSpace = OldHeap->rd_rel->reltablespace;
*************** rebuild_relation(Relation OldHeap, Oid i
*** 605,611 ****
  	/* Copy the heap data into the new table in the desired order */
  	copy_heap_data(OIDNewHeap, tableOid, indexOid,
  				   freeze_min_age, freeze_table_age,
! 				   &swap_toast_by_content, &frozenXid);
  
  	/*
  	 * Swap the physical files of the target and transient tables, then
--- 595,601 ----
  	/* Copy the heap data into the new table in the desired order */
  	copy_heap_data(OIDNewHeap, tableOid, indexOid,
  				   freeze_min_age, freeze_table_age,
! 				   &swap_toast_by_content, &frozenXid, verbose);
  
  	/*
  	 * Swap the physical files of the target and transient tables, then
*************** make_new_heap(Oid OIDOldHeap, Oid NewTab
*** 747,753 ****
  static void
  copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex,
  			   int freeze_min_age, int freeze_table_age,
! 			   bool *pSwapToastByContent, TransactionId *pFreezeXid)
  {
  	Relation	NewHeap,
  				OldHeap,
--- 737,744 ----
  static void
  copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex,
  			   int freeze_min_age, int freeze_table_age,
! 			   bool *pSwapToastByContent, TransactionId *pFreezeXid,
! 			   bool verbose)
  {
  	Relation	NewHeap,
  				OldHeap,
*************** copy_heap_data(Oid OIDNewHeap, Oid OIDOl
*** 766,771 ****
--- 757,769 ----
  	RewriteState rwstate;
  	bool 		 use_sort;
  	Tuplesortstate *tuplesort;
+ 	PGRUsage	ru0;
+ 	double		num_tuples = 0,
+ 				tups_vacuumed = 0,
+ 				nkeep = 0;
+ 	int			elevel = verbose ? INFO : DEBUG2;
+ 
+ 	pg_rusage_init(&ru0);
  
  	/*
  	 * Open the relations we need.
*************** copy_heap_data(Oid OIDNewHeap, Oid OIDOl
*** 891,896 ****
--- 889,912 ----
  			tuplesort = NULL;
  	}
  
+ 	/* Log what we're doing */
+ 	if (indexScan != NULL)
+ 		ereport(elevel,
+ 				(errmsg("clustering \"%s.%s\" by index scan on \"%s\"",
+ 						get_namespace_name(RelationGetNamespace(OldHeap)),
+ 						RelationGetRelationName(OldHeap),
+ 						RelationGetRelationName(OldIndex))));
+ 	else if (tuplesort != NULL)
+ 		ereport(elevel,
+ 				(errmsg("clustering \"%s.%s\" by sequential scan and sort",
+ 						get_namespace_name(RelationGetNamespace(OldHeap)),
+ 						RelationGetRelationName(OldHeap))));
+ 	else
+ 		ereport(elevel,
+ 				(errmsg("vacuuming \"%s.%s\"",
+ 						get_namespace_name(RelationGetNamespace(OldHeap)),
+ 						RelationGetRelationName(OldHeap))));
+ 
  	for (;;)
  	{
  		HeapTuple	tuple;
*************** copy_heap_data(Oid OIDNewHeap, Oid OIDOl
*** 928,935 ****
  				/* Definitely dead */
  				isdead = true;
  				break;
- 			case HEAPTUPLE_LIVE:
  			case HEAPTUPLE_RECENTLY_DEAD:
  				/* Live or recently dead, must copy it */
  				isdead = false;
  				break;
--- 944,953 ----
  				/* Definitely dead */
  				isdead = true;
  				break;
  			case HEAPTUPLE_RECENTLY_DEAD:
+ 				nkeep += 1;
+ 				/* fall through */
+ 			case HEAPTUPLE_LIVE:
  				/* Live or recently dead, must copy it */
  				isdead = false;
  				break;
*************** copy_heap_data(Oid OIDNewHeap, Oid OIDOl
*** 973,983 ****
  
  		if (isdead)
  		{
  			/* heap rewrite module still needs to see it... */
! 			rewrite_heap_dead_tuple(rwstate, tuple);
  			continue;
  		}
  
  		if (tuplesort != NULL)
  			tuplesort_putrawtuple(tuplesort, tuple);
  		else
--- 991,1004 ----
  
  		if (isdead)
  		{
+ 			tups_vacuumed += 1;
  			/* heap rewrite module still needs to see it... */
! 			if (rewrite_heap_dead_tuple(rwstate, tuple))
! 				nkeep += 1;
  			continue;
  		}
  
+ 		num_tuples += 1;
  		if (tuplesort != NULL)
  			tuplesort_putrawtuple(tuplesort, tuple);
  		else
*************** copy_heap_data(Oid OIDNewHeap, Oid OIDOl
*** 1030,1035 ****
--- 1051,1065 ----
  	pfree(values);
  	pfree(isnull);
  
+ 	ereport(elevel,
+ 			(errmsg("\"%s\": found %.0f removable, %.0f nonremovable row versions in %u pages",
+ 					RelationGetRelationName(OldHeap),
+ 					tups_vacuumed, num_tuples, RelationGetNumberOfBlocks(OldHeap)),
+ 			 errdetail("%.0f dead row versions cannot be removed yet.\n"
+ 					   "%s.",
+ 					   nkeep,
+ 					   pg_rusage_show(&ru0))));
+ 
  	if (OldIndex != NULL)
  		index_close(OldIndex, NoLock);
  	heap_close(OldHeap, NoLock);
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 8ede6be..53a9ded 100644
*** a/src/include/access/rewriteheap.h
--- b/src/include/access/rewriteheap.h
*************** extern RewriteState begin_heap_rewrite(R
*** 25,30 ****
  extern void end_heap_rewrite(RewriteState state);
  extern void rewrite_heap_tuple(RewriteState state, HeapTuple oldTuple,
  				   HeapTuple newTuple);
! extern void rewrite_heap_dead_tuple(RewriteState state, HeapTuple oldTuple);
  
  #endif   /* REWRITE_HEAP_H */
--- 25,30 ----
  extern void end_heap_rewrite(RewriteState state);
  extern void rewrite_heap_tuple(RewriteState state, HeapTuple oldTuple,
  				   HeapTuple newTuple);
! extern bool rewrite_heap_dead_tuple(RewriteState state, HeapTuple oldTuple);
  
  #endif   /* REWRITE_HEAP_H */

#47

Abhijit Menon-Sen

ams@toroid.org

over 15 years ago

In reply to: Itagaki Takahiro (#46)

[REVIEW] Re: I: About "Our CLUSTER implementation is pessimal" patch

(Sorry for the broken threading. I didn't have a convenient copy of the
original message to reply to.)

I looked at the patch and it seems quite reasonable, but two hunks of
the changes to src/backend/commands/cluster.c don't apply cleanly. I'm
not sure what version the patch was generated against, but the code in
copy_heap_data() seems to have changed quite a bit. I don't think it
would be too much trouble to adapt the changes, though.

-- ams

#48

itagaki.takahiro@gmail.com

over 15 years ago

In reply to: Itagaki Takahiro (#45)

1 attachment(s)

Re: I: About "Our CLUSTER implementation is pessimal" patch

On Tue, Aug 31, 2010 at 8:04 PM, Itagaki Takahiro
<itagaki.takahiro@gmail.com> wrote:

I think the patch is almost ready to commit, but still
have some comments for the usability and documentations.
I hope native English speakers would help improving docs.

I'm checking the latest patch for applying.
I found we actually use maintenance_work_mem for the sort in seqscan+sort
case, but the cost was estimated based on work_mem in the patch. I added
internal cost_sort_with_mem() into costsize.c.

* Documentation could be a bit more simplified like as
"CLUSTER requires twice disk spaces of your original table".
The added description seems too difficult for standard users.

I re-ordered some description in the doc. Does it look better?
Comments and suggestions welcome.

--
Itagaki Takahiro

Attachments:

sorted_cluster-20100928.patchapplication/octet-stream; name=sorted_cluster-20100928.patchDownload

diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index 4b64195..81c8f9f 100644
*** a/doc/src/sgml/ref/cluster.sgml
--- b/doc/src/sgml/ref/cluster.sgml
*************** CLUSTER [VERBOSE]
*** 128,138 ****
     </para>
  
     <para>
!     During the cluster operation, a temporary copy of the table is created
!     that contains the table data in the index order.  Temporary copies of
!     each index on the table are created as well.  Therefore, you need free
!     space on disk at least equal to the sum of the table size and the index
!     sizes.
     </para>
  
     <para>
--- 128,167 ----
     </para>
  
     <para>
!     The planner can choose between two different methods to cluster the table:
!     <itemizedlist spacing="compact">
!      <listitem><para>index scan</para></listitem>
!      <listitem><para>table scan followed by sort</para></listitem>
!     </itemizedlist>
!    </para>
! 
!    <para>
!      In the first case, during the cluster operation, a temporary copy of the
!      table is created that contains the table data in the index order.
!      Temporary copies of each index on the table are created as well.  Therefore,
!      you need free space on disk at least equal to the sum of the table size and
!      the index sizes.
!    </para>
! 
!    <para>
!     In the second case, a full table scan is followed by a sort operation.
!     The method is faster than the first one when the table is highly fragmented.
!     You need twice disk space of the sum in the case. In addition to the free
!     space needed by the previous case, this approach may also need a temporary
!     disk sort file which can be as big as the original table.
!    </para>
! 
!    <para>
!      The planner tries to choose a faster method in them base on the information
!      below. It is advisable to run <xref linkend="sql-analyze"> on the table
!      before clustering it, and to set the resource consumption parameters to
!      sensible values, especially <xref linkend="guc-maintenance-work-mem">.
!      Otherwise, the planner might make poor choices of the clustering method to use.
!     <itemizedlist spacing="compact">
!      <listitem><para>statistics on the table</para></listitem>
!      <listitem><para>resource consumption config (see <xref linkend="runtime-config-resource">)</para></listitem>
!      <listitem><para>planner method configuration (see <xref linkend="runtime-config-query-enable">)</para></listitem>
!     </itemizedlist>
     </para>
  
     <para>
*************** CLUSTER [VERBOSE]
*** 149,184 ****
      Otherwise, the planner might make poor choices of query plans.
     </para>
  
-    <para>
-     There is another way to cluster data. The
-     <command>CLUSTER</command> command reorders the original table by
-     scanning it using the index you specify. This can be slow
-     on large tables because the rows are fetched from the table
-     in index order, and if the table is disordered, the
-     entries are on random pages, so there is one disk page
-     retrieved for every row moved. (<productname>PostgreSQL</productname> has
-     a cache, but the majority of a big table will not fit in the cache.)
-     The other way to cluster a table is to use:
- 
- <programlisting>
- CREATE TABLE <replaceable class="parameter">newtable</replaceable> AS
-     SELECT * FROM <replaceable class="parameter">table</replaceable> ORDER BY <replaceable class="parameter">columnlist</replaceable>;
- </programlisting>
- 
-     which uses the <productname>PostgreSQL</productname> sorting code
-     to produce the desired order;
-     this is usually much faster than an index scan for disordered data.
-     Then you drop the old table, use
-     <command>ALTER TABLE ... RENAME</command>
-     to rename <replaceable class="parameter">newtable</replaceable> to the
-     old name, and recreate the table's indexes.
-     The big disadvantage of this approach is that it does not preserve
-     OIDs, constraints, foreign key relationships, granted privileges, and
-     other ancillary properties of the table &mdash; all such items must be
-     manually recreated.  Another disadvantage is that this way requires a sort
-     temporary file about the same size as the table itself, so peak disk usage
-     is about three times the table size instead of twice the table size.
-    </para>
   </refsect1>
  
   <refsect1>
--- 178,183 ----
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index a2a2bbf..8445e02 100644
*** a/src/backend/commands/cluster.c
--- b/src/backend/commands/cluster.c
***************
*** 36,41 ****
--- 36,42 ----
  #include "commands/trigger.h"
  #include "commands/vacuum.h"
  #include "miscadmin.h"
+ #include "optimizer/cost.h"
  #include "storage/bufmgr.h"
  #include "storage/procarray.h"
  #include "storage/smgr.h"
***************
*** 49,54 ****
--- 50,56 ----
  #include "utils/snapmgr.h"
  #include "utils/syscache.h"
  #include "utils/tqual.h"
+ #include "utils/tuplesort.h"
  
  
  /*
*************** static void copy_heap_data(Oid OIDNewHea
*** 69,74 ****
--- 71,79 ----
  			   int freeze_min_age, int freeze_table_age,
  			   bool *pSwapToastByContent, TransactionId *pFreezeXid);
  static List *get_tables_to_cluster(MemoryContext cluster_context);
+ static void deform_and_rewrite_tuple(HeapTuple tuple, TupleDesc oldTupDesc,
+ 			   TupleDesc newTupDesc, Datum *values, bool *isnull,
+ 			   bool newRelHasOids, RewriteState rwstate);
  
  
  
*************** copy_heap_data(Oid OIDNewHeap, Oid OIDOl
*** 759,764 ****
--- 764,771 ----
  	TransactionId OldestXmin;
  	TransactionId FreezeXid;
  	RewriteState rwstate;
+ 	bool 		 use_sort;
+ 	Tuplesortstate *tuplesort;
  
  	/*
  	 * Open the relations we need.
*************** copy_heap_data(Oid OIDNewHeap, Oid OIDOl
*** 850,878 ****
  	 * tuples that still need to be copied, we scan with SnapshotAny and use
  	 * HeapTupleSatisfiesVacuum for the visibility test.
  	 */
! 	if (OldIndex != NULL)
  	{
  		heapScan = NULL;
  		indexScan = index_beginscan(OldHeap, OldIndex,
  									SnapshotAny, 0, (ScanKey) NULL);
  	}
  	else
  	{
  		heapScan = heap_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL);
  		indexScan = NULL;
  	}
  
  	for (;;)
  	{
  		HeapTuple	tuple;
- 		HeapTuple	copiedTuple;
  		Buffer		buf;
  		bool		isdead;
- 		int			i;
  
  		CHECK_FOR_INTERRUPTS();
  
! 		if (OldIndex != NULL)
  		{
  			tuple = index_getnext(indexScan, ForwardScanDirection);
  			if (tuple == NULL)
--- 857,906 ----
  	 * tuples that still need to be copied, we scan with SnapshotAny and use
  	 * HeapTupleSatisfiesVacuum for the visibility test.
  	 */
! 	if (OldIndex != NULL && OldIndex->rd_rel->relam == BTREE_AM_OID)
! 	{
! 		/*
! 		 * Check to see if a scan-and-sort would be less expensive than
! 		 * an index scan.
! 		 */
! 		Cost 	indexScanCost,
! 				seqScanAndSortCost;
! 
! 		cost_index_scan_vs_seqscansort(OIDOldHeap, OIDOldIndex,
! 									   &indexScanCost, &seqScanAndSortCost,
! 									   maintenance_work_mem);
! 		use_sort = seqScanAndSortCost < indexScanCost;
! 	}
! 	else
! 		use_sort = false;
! 
! 	if (OldIndex != NULL && !use_sort)
  	{
  		heapScan = NULL;
  		indexScan = index_beginscan(OldHeap, OldIndex,
  									SnapshotAny, 0, (ScanKey) NULL);
+ 		tuplesort = NULL;
  	}
  	else
  	{
  		heapScan = heap_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL);
  		indexScan = NULL;
+ 		if (use_sort)
+ 			tuplesort = tuplesort_begin_rawheap(OldIndex, oldTupDesc,
+ 												maintenance_work_mem, false);
+ 		else
+ 			tuplesort = NULL;
  	}
  
  	for (;;)
  	{
  		HeapTuple	tuple;
  		Buffer		buf;
  		bool		isdead;
  
  		CHECK_FOR_INTERRUPTS();
  
! 		if (indexScan != NULL)
  		{
  			tuple = index_getnext(indexScan, ForwardScanDirection);
  			if (tuple == NULL)
*************** copy_heap_data(Oid OIDNewHeap, Oid OIDOl
*** 951,995 ****
  			continue;
  		}
  
! 		/*
! 		 * We cannot simply copy the tuple as-is, for several reasons:
! 		 *
! 		 * 1. We'd like to squeeze out the values of any dropped columns, both
! 		 * to save space and to ensure we have no corner-case failures. (It's
! 		 * possible for example that the new table hasn't got a TOAST table
! 		 * and so is unable to store any large values of dropped cols.)
! 		 *
! 		 * 2. The tuple might not even be legal for the new table; this is
! 		 * currently only known to happen as an after-effect of ALTER TABLE
! 		 * SET WITHOUT OIDS.
! 		 *
! 		 * So, we must reconstruct the tuple from component Datums.
! 		 */
! 		heap_deform_tuple(tuple, oldTupDesc, values, isnull);
  
! 		/* Be sure to null out any dropped columns */
! 		for (i = 0; i < natts; i++)
! 		{
! 			if (newTupDesc->attrs[i]->attisdropped)
! 				isnull[i] = true;
! 		}
  
! 		copiedTuple = heap_form_tuple(newTupDesc, values, isnull);
  
! 		/* Preserve OID, if any */
! 		if (NewHeap->rd_rel->relhasoids)
! 			HeapTupleSetOid(copiedTuple, HeapTupleGetOid(tuple));
  
! 		/* The heap rewrite module does the rest */
! 		rewrite_heap_tuple(rwstate, tuple, copiedTuple);
  
! 		heap_freetuple(copiedTuple);
  	}
  
! 	if (OldIndex != NULL)
  		index_endscan(indexScan);
! 	else
  		heap_endscan(heapScan);
  
  	/* Write out any remaining tuples, and fsync if needed */
  	end_heap_rewrite(rwstate);
--- 979,1025 ----
  			continue;
  		}
  
! 		if (tuplesort != NULL)
! 			tuplesort_putrawtuple(tuplesort, tuple);
! 		else
! 			deform_and_rewrite_tuple(tuple, oldTupDesc, newTupDesc,
! 									 values, isnull,
! 									 NewHeap->rd_rel->relhasoids, rwstate);
! 	}
  
! 	/*
! 	 * In scan-and-sort mode, read all tuples from tuplestore and write out
! 	 * them into the new relation.
! 	 */
! 	if (tuplesort != NULL)
! 	{
! 		tuplesort_performsort(tuplesort);
  
! 		for (;;)
! 		{
! 			HeapTuple	tuple;
! 			bool		shouldfree;
  
! 			CHECK_FOR_INTERRUPTS();
  
! 			tuple = tuplesort_getrawtuple(tuplesort, true, &shouldfree);
! 			if (tuple == NULL)
! 				break;
  
! 			deform_and_rewrite_tuple(tuple, oldTupDesc, newTupDesc,
! 									 values, isnull,
! 									 NewHeap->rd_rel->relhasoids, rwstate);
! 			if (shouldfree)
! 				heap_freetuple(tuple);
! 		}
  	}
  
! 	if (indexScan != NULL)
  		index_endscan(indexScan);
! 	if (heapScan != NULL)
  		heap_endscan(heapScan);
+ 	if (tuplesort != NULL)
+ 		tuplesort_end(tuplesort);
  
  	/* Write out any remaining tuples, and fsync if needed */
  	end_heap_rewrite(rwstate);
*************** get_tables_to_cluster(MemoryContext clus
*** 1488,1490 ****
--- 1518,1565 ----
  
  	return rvs;
  }
+ 
+ static void
+ deform_and_rewrite_tuple(HeapTuple tuple,
+ 						 TupleDesc oldTupDesc, TupleDesc newTupDesc,
+ 						 Datum *values, bool *isnull,
+ 						 bool newRelHasOids, RewriteState rwstate)
+ {
+ 	HeapTuple	copiedTuple;
+ 	int 		i;
+ 
+ 	/*
+ 	 * We cannot simply copy the tuple as-is, for several reasons:
+ 	 *
+ 	 * 1. We'd like to squeeze out the values of any dropped columns, both
+ 	 * to save space and to ensure we have no corner-case failures. (It's
+ 	 * possible for example that the new table hasn't got a TOAST table
+ 	 * and so is unable to store any large values of dropped cols.)
+ 	 *
+ 	 * 2. The tuple might not even be legal for the new table; this is
+ 	 * currently only known to happen as an after-effect of ALTER TABLE
+ 	 * SET WITHOUT OIDS.
+ 	 *
+ 	 * So, we must reconstruct the tuple from component Datums.
+ 	 */
+ 
+ 	heap_deform_tuple(tuple, oldTupDesc, values, isnull);
+ 
+ 	/* Be sure to null out any dropped columns */
+ 	for (i = 0; i < newTupDesc->natts; i++)
+ 	{
+ 		if (newTupDesc->attrs[i]->attisdropped)
+ 			isnull[i] = true;
+ 	}
+ 
+ 	copiedTuple = heap_form_tuple(newTupDesc, values, isnull);
+ 
+ 	/* Preserve OID, if any */
+ 	if (newRelHasOids)
+ 		HeapTupleSetOid(copiedTuple, HeapTupleGetOid(tuple));
+ 
+ 	/* The heap rewrite module does the rest */
+ 	rewrite_heap_tuple(rwstate, tuple, copiedTuple);
+ 
+ 	heap_freetuple(copiedTuple);
+ }
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 53aa62f..55ec4dc 100644
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
***************
*** 76,81 ****
--- 76,82 ----
  #include "optimizer/cost.h"
  #include "optimizer/pathnode.h"
  #include "optimizer/placeholder.h"
+ #include "optimizer/plancat.h"
  #include "optimizer/planmain.h"
  #include "optimizer/restrictinfo.h"
  #include "parser/parsetree.h"
*************** static double approx_tuple_count(Planner
*** 140,145 ****
--- 141,149 ----
  static void set_rel_width(PlannerInfo *root, RelOptInfo *rel);
  static double relation_byte_size(double tuples, int width);
  static double page_size(double tuples, int width);
+ static void cost_sort_with_mem(Path *path, PlannerInfo *root,
+ 				List *pathkeys, Cost input_cost, double tuples, int width,
+ 				double limit_tuples, int sort_mem);
  
  
  /*
*************** cost_sort(Path *path, PlannerInfo *root,
*** 1112,1123 ****
  		  List *pathkeys, Cost input_cost, double tuples, int width,
  		  double limit_tuples)
  {
  	Cost		startup_cost = input_cost;
  	Cost		run_cost = 0;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
! 	long		work_mem_bytes = work_mem * 1024L;
  
  	if (!enable_sort)
  		startup_cost += disable_cost;
--- 1116,1137 ----
  		  List *pathkeys, Cost input_cost, double tuples, int width,
  		  double limit_tuples)
  {
+ 	cost_sort_with_mem(path, root, pathkeys, input_cost, tuples, width,
+ 					   limit_tuples, work_mem);
+ }
+ 
+ static void
+ cost_sort_with_mem(Path *path, PlannerInfo *root,
+ 				   List *pathkeys, Cost input_cost, double tuples, int width,
+ 				   double limit_tuples, int sort_mem)
+ {
+ 
  	Cost		startup_cost = input_cost;
  	Cost		run_cost = 0;
  	double		input_bytes = relation_byte_size(tuples, width);
  	double		output_bytes;
  	double		output_tuples;
! 	long		work_mem_bytes = sort_mem * 1024L;
  
  	if (!enable_sort)
  		startup_cost += disable_cost;
*************** cost_qual_eval_walker(Node *node, cost_q
*** 2672,2677 ****
--- 2686,2764 ----
  								  (void *) context);
  }
  
+ /*
+  * cost_index_scan_vs_seqscansort
+  *		Estimates and returns the cost of an full-index scan and a sort after
+  *		a sequential scan.
+  */
+ void
+ cost_index_scan_vs_seqscansort(Oid tableOid,
+ 							   Oid indexOid,
+ 							   Cost *indexScanCost,
+ 							   Cost *seqScanAndSortCost,
+ 							   int sort_mem)
+ {
+ 	RelOptInfo	   *rel;
+ 	PlannerInfo	   *root;
+ 	Query		   *query;
+ 	PlannerGlobal  *glob;
+ 	RangeTblEntry  *rte;
+ 	ListCell	   *index;
+ 	IndexPath	   *indexScanPath;
+ 	Path		   *seqScanAndSortPath;
+ 
+ 	/* make a dummy planner */
+ 	glob = makeNode(PlannerGlobal);
+ 
+ 	query = makeNode(Query);
+ 	query->resultRelation = 0;
+ 
+ 	root = makeNode(PlannerInfo);
+ 	root->parse = query;
+ 	root->glob = glob;
+ 
+ 	rel = makeNode(RelOptInfo);
+ 	rel->reloptkind = RELOPT_BASEREL;
+ 	rel->relid = 1;
+ 	rel->rtekind = RTE_RELATION;
+ 	get_relation_info(root, tableOid, false, rel);
+ 	rel->rows = rel->tuples;
+ 
+ 	root->total_table_pages = rel->pages;
+ 
+ 	rte = makeNode(RangeTblEntry);
+ 	rte->rtekind = RTE_RELATION;
+ 	rte->relid = tableOid;
+ 
+ 	root->simple_rel_array_size = 2;
+ 	root->simple_rte_array = (RangeTblEntry **)
+ 		palloc0(root->simple_rel_array_size * sizeof(RangeTblEntry *));
+ 	root->simple_rte_array[1] = rte;
+ 
+ 	/* estimate the cost of seq scan + sort */
+ 	seqScanAndSortPath = create_seqscan_path(NULL, rel);
+ 	cost_sort_with_mem(seqScanAndSortPath, root, NULL,
+ 					   seqScanAndSortPath->total_cost, rel->tuples,
+ 					   rel->width, -1, sort_mem);
+ 
+ 	/* estimate the cost of index scan */
+ 	indexScanPath = NULL;
+ 	foreach(index, rel->indexlist)
+ 	{
+ 		IndexOptInfo *indexInfo = (IndexOptInfo*) lfirst(index);
+ 		if (indexInfo->indexoid == indexOid)
+ 		{
+ 			indexScanPath = create_index_path(root, indexInfo, NULL, NULL,
+ 											  ForwardScanDirection, NULL);
+ 			break;
+ 		}
+ 	}
+ 	Assert(indexScanPath != NULL);
+ 
+ 	*indexScanCost = indexScanPath->path.total_cost;
+ 	*seqScanAndSortCost = seqScanAndSortPath->total_cost;
+ }
+ 
  
  /*
   * adjust_semi_join
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index cf0a583..4ed6ecc 100644
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
***************
*** 104,109 ****
--- 104,110 ----
  #include "access/nbtree.h"
  #include "catalog/pg_amop.h"
  #include "catalog/pg_operator.h"
+ #include "catalog/index.h"
  #include "commands/tablespace.h"
  #include "miscadmin.h"
  #include "pg_trace.h"
***************
*** 115,126 ****
--- 116,129 ----
  #include "utils/rel.h"
  #include "utils/syscache.h"
  #include "utils/tuplesort.h"
+ #include "executor/executor.h"
  
  
  /* sort-type codes for sort__start probes */
  #define HEAP_SORT	0
  #define INDEX_SORT	1
  #define DATUM_SORT	2
+ #define RAWHEAP_SORT	3
  
  /* GUC variables */
  #ifdef TRACE_SORT
*************** struct Tuplesortstate
*** 366,371 ****
--- 369,378 ----
  	int			datumTypeLen;
  	bool		datumTypeByVal;
  
+ 	/* These are specific to the rawheap subcase: */
+ 	EState 	   *estate;
+ 	IndexInfo  *indexInfo;
+ 
  	/*
  	 * Resource snapshot for time of sort start.
  	 */
*************** static void writetup_heap(Tuplesortstate
*** 450,455 ****
--- 457,470 ----
  static void readtup_heap(Tuplesortstate *state, SortTuple *stup,
  			 int tapenum, unsigned int len);
  static void reversedirection_heap(Tuplesortstate *state);
+ static int comparetup_rawheap(const SortTuple *a, const SortTuple *b,
+ 							  Tuplesortstate *state);
+ static void copytup_rawheap(Tuplesortstate *state, SortTuple *stup, void *tup);
+ static void writetup_rawheap(Tuplesortstate *state, int tapenum, SortTuple *stup);
+ static void readtup_rawheap(Tuplesortstate *state, SortTuple *stup,
+ 							int tapenum, unsigned int len);
+ static void extract_keys_rawheap(Tuplesortstate *state, SortTuple *stup,
+ 							HeapTuple tuple);
  static int comparetup_index_btree(const SortTuple *a, const SortTuple *b,
  					   Tuplesortstate *state);
  static int comparetup_index_hash(const SortTuple *a, const SortTuple *b,
*************** tuplesort_begin_common(int workMem, bool
*** 549,554 ****
--- 564,572 ----
  
  	state->result_tape = -1;	/* flag that result tape has not been formed */
  
+ 	/* set estate to NULL, so we don't try to free it later if not used */
+ 	state->estate = NULL;
+ 
  	MemoryContextSwitchTo(oldcontext);
  
  	return state;
*************** tuplesort_begin_datum(Oid datumType,
*** 762,767 ****
--- 780,845 ----
  	return state;
  }
  
+ Tuplesortstate *
+ tuplesort_begin_rawheap(Relation indexRel, TupleDesc tupDesc,
+ 						int workMem, bool randomAccess)
+ {
+ 	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+ 	MemoryContext oldcontext;
+ 
+ 	TupleTableSlot *existing_slot;
+ 	ExprContext	   *econtext;
+ 
+ 	Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
+ 
+ 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ 
+ #ifdef TRACE_SORT
+ 	if (trace_sort)
+ 		elog(LOG,
+ 			 "begin tuple sort: nkeys = %d, workMem = %d, randomAccess = %c",
+ 			 RelationGetNumberOfAttributes(indexRel), workMem, randomAccess ? 't' : 'f');
+ #endif
+ 
+ 	state->nKeys = RelationGetNumberOfAttributes(indexRel);
+ 
+ 	TRACE_POSTGRESQL_SORT_START(RAWHEAP_SORT, false, state->nKeys, workMem, randomAccess);
+ 
+ 	state->comparetup = comparetup_rawheap;
+ 	state->copytup = copytup_rawheap;
+ 	state->writetup = writetup_rawheap;
+ 	state->readtup = readtup_rawheap;
+ 	state->reversedirection = reversedirection_heap;
+ 
+ 	state->indexInfo = BuildIndexInfo(indexRel);
+ 	state->indexScanKey = _bt_mkscankey_nodata(indexRel);
+ 	state->enforceUnique = false;
+ 
+ 	state->tupDesc = tupDesc;	/* assume we need not copy tupDesc */
+ 
+ 	if (state->indexInfo->ii_Expressions != NULL)
+ 	{
+ 		/* allocate the vars used by FormIndexDatum */
+ 		state->estate = CreateExecutorState();
+ 
+ 		/*
+ 		 * Need a TupleTableSlot to put existing tuples in.
+ 		 *
+ 		 * To use FormIndexDatum, we have to make the econtext's scantuple point
+ 		 * to this slot.  Be sure to save and restore caller's value for
+ 		 * scantuple.
+ 		 */
+ 		existing_slot = MakeSingleTupleTableSlot(tupDesc);
+ 
+ 		econtext = GetPerTupleExprContext(state->estate);
+ 		econtext->ecxt_scantuple = existing_slot;
+ 	}
+ 
+ 	MemoryContextSwitchTo(oldcontext);
+ 
+ 	return state;
+ }
+ 
  /*
   * tuplesort_set_bound
   *
*************** tuplesort_end(Tuplesortstate *state)
*** 849,854 ****
--- 927,935 ----
  	 */
  	TRACE_POSTGRESQL_SORT_DONE(state->tapeset != NULL, 0L);
  #endif
+ 	if (state->estate != NULL)
+ 		ExecDropSingleTupleTableSlot(GetPerTupleExprContext(state->estate)->ecxt_scantuple);
+ 
  
  	MemoryContextSwitchTo(oldcontext);
  
*************** tuplesort_putdatum(Tuplesortstate *state
*** 980,985 ****
--- 1061,1088 ----
  }
  
  /*
+  * Accept one tuple while collecting input data for sort.
+  *
+  * Note that the input data is always copied; the caller need not save it.
+  */
+ void
+ tuplesort_putrawtuple(Tuplesortstate *state, HeapTuple tup)
+ {
+ 	MemoryContext	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ 	SortTuple		stup;
+ 
+ 	/*
+ 	 * Copy the given tuple into memory we control, and decrease availMem.
+ 	 * Then call the common code.
+ 	 */
+ 	COPYTUP(state, &stup, (void *) tup);
+ 
+ 	puttuple_common(state, &stup);
+ 
+ 	MemoryContextSwitchTo(oldcontext);
+ }
+ 
+ /*
   * Shared code for tuple and datum cases.
   */
  static void
*************** tuplesort_getdatum(Tuplesortstate *state
*** 1482,1487 ****
--- 1585,1609 ----
  }
  
  /*
+  * Fetch the next tuple in either forward or back direction. Returns NULL if
+  * no more tuples. If *should_free is set, the caller must pfree the returned
+  * tuple when done with it.
+  */
+ HeapTuple
+ tuplesort_getrawtuple(Tuplesortstate *state, bool forward, bool *should_free)
+ {
+ 	MemoryContext	oldcontext = MemoryContextSwitchTo(state->sortcontext);
+ 	SortTuple		stup;
+ 
+ 	if (!tuplesort_gettuple_common(state, forward, &stup, should_free))
+ 		stup.tuple = NULL;
+ 
+ 	MemoryContextSwitchTo(oldcontext);
+ 
+ 	return stup.tuple;
+ }
+ 
+ /*
   * tuplesort_merge_order - report merge order we'll use for given memory
   * (note: "merge order" just means the number of input tapes in the merge).
   *
*************** reversedirection_datum(Tuplesortstate *s
*** 3079,3084 ****
--- 3201,3387 ----
  }
  
  /*
+  * Routines specialized for raw on-disk HeapTuples. These are only used when
+  * we need visibility information for cases like CLUSTER. Otherwise we should
+  * use other more effective methods that manipulate tuples as MinimalTuples.
+  */
+ static int
+ comparetup_rawheap(const SortTuple *a, const SortTuple *b, Tuplesortstate *state)
+ {
+ 	ScanKey		scanKey = state->indexScanKey;
+ 	HeapTuple	ltup;
+ 	HeapTuple	rtup;
+ 	TupleDesc	tupDesc;
+ 	int			nkey;
+ 	int32		compare;
+ 
+ 	/* Allow interrupting long sorts */
+ 	CHECK_FOR_INTERRUPTS();
+ 
+ 	/* Compare the leading sort key */
+ 	compare = inlineApplySortFunction(&scanKey->sk_func,
+ 									  scanKey->sk_flags,
+ 									  a->datum1, a->isnull1,
+ 									  b->datum1, b->isnull1);
+ 	if (compare != 0)
+ 		return compare;
+ 
+ 	/* Compare additional sort keys */
+ 	ltup = (HeapTuple) a->tuple;
+ 	rtup = (HeapTuple) b->tuple;
+ 
+ 	if (state->indexInfo->ii_Expressions == NULL)
+ 	{
+ 		/* if not expression index, just get the proper heap_getattr */
+ 
+ 		tupDesc = state->tupDesc;
+ 		scanKey++;
+ 
+ 		for (nkey = 1; nkey < state->nKeys; nkey++, scanKey++)
+ 		{
+ 			Datum		datum1,
+ 						datum2;
+ 			bool		isnull1,
+ 						isnull2;
+ 
+ 			datum1 = heap_getattr(ltup, state->indexInfo->ii_KeyAttrNumbers[nkey], tupDesc, &isnull1);
+ 			datum2 = heap_getattr(rtup, state->indexInfo->ii_KeyAttrNumbers[nkey], tupDesc, &isnull2);
+ 
+ 			compare = inlineApplySortFunction(&scanKey->sk_func,
+ 											  scanKey->sk_flags,
+ 											  datum1, isnull1,
+ 											  datum2, isnull2);
+ 			if (compare != 0)
+ 				return compare;
+ 		}
+ 	}
+ 	else
+ 	{
+ 		/*
+ 		 * in the expression index case, we get all the values/nulls:
+ 		 * it would be faster to get only the required ones, but it would mean
+ 		 * copy&paste from FormIndexDatum
+ 		 */
+ 
+ 		Datum		l_existing_values[INDEX_MAX_KEYS];
+ 		bool		l_existing_isnull[INDEX_MAX_KEYS];
+ 		Datum		r_existing_values[INDEX_MAX_KEYS];
+ 		bool		r_existing_isnull[INDEX_MAX_KEYS];
+ 		TupleTableSlot *ecxt_scantuple;
+ 
+ 		ecxt_scantuple = GetPerTupleExprContext(state->estate)->ecxt_scantuple;
+ 
+ 		ExecStoreTuple(ltup, ecxt_scantuple, InvalidBuffer, false);
+ 		FormIndexDatum(state->indexInfo, ecxt_scantuple, state->estate,
+ 					   l_existing_values, l_existing_isnull);
+ 
+ 		ExecStoreTuple(rtup, ecxt_scantuple, InvalidBuffer, false);
+ 		FormIndexDatum(state->indexInfo, ecxt_scantuple, state->estate,
+ 					   r_existing_values, r_existing_isnull);
+ 
+ 		for (nkey = 1; nkey < state->nKeys; nkey++, scanKey++)
+ 		{
+ 			compare = inlineApplySortFunction(&scanKey->sk_func,
+ 											  scanKey->sk_flags,
+ 											  l_existing_values[nkey],
+ 											  l_existing_isnull[nkey],
+ 											  r_existing_values[nkey],
+ 											  r_existing_isnull[nkey]);
+ 			if (compare != 0)
+ 				return compare;
+ 		}
+ 	}
+ 
+ 	return 0;
+ }
+ 
+ static void
+ copytup_rawheap(Tuplesortstate *state, SortTuple *stup, void *tup)
+ {
+ 	HeapTuple	tuple = (HeapTuple) tup;
+ 
+ 	/* copy the tuple into sort storage */
+ 	stup->tuple = (void *) heap_copytuple(tuple);
+ 	USEMEM(state, GetMemoryChunkSpace(stup->tuple));
+ 
+ 	extract_keys_rawheap(state, stup, tuple);
+ }
+ 
+ static void
+ writetup_rawheap(Tuplesortstate *state, int tapenum, SortTuple *stup)
+ {
+ 	HeapTuple	tuple = (HeapTuple) stup->tuple;
+ 	int	    	tuplen = tuple->t_len + HEAPTUPLESIZE;
+ 
+ 	LogicalTapeWrite(state->tapeset, tapenum,
+ 					 &tuplen, sizeof(tuplen));
+ 	LogicalTapeWrite(state->tapeset, tapenum,
+ 					 (char *) tuple + sizeof(tuplen),
+ 					 HEAPTUPLESIZE - sizeof(tuplen));
+ 	LogicalTapeWrite(state->tapeset, tapenum, tuple->t_data, tuple->t_len);
+ 	if (state->randomAccess)	/* need trailing length word? */
+ 		LogicalTapeWrite(state->tapeset, tapenum, &tuplen, sizeof(tuplen));
+ 
+ 	FREEMEM(state, GetMemoryChunkSpace(tuple));
+ 	heap_freetuple(tuple);
+ }
+ 
+ static void
+ readtup_rawheap(Tuplesortstate *state, SortTuple *stup,
+ 				int tapenum, unsigned int tuplen)
+ {
+ 	HeapTuple	tuple = (HeapTuple) palloc(tuplen);
+ 
+ 	USEMEM(state, GetMemoryChunkSpace(tuple));
+ 
+ 	tuple->t_len = tuplen - HEAPTUPLESIZE;
+ 	if (LogicalTapeRead(state->tapeset, tapenum,
+ 						 (char *) tuple + sizeof(tuplen),
+ 			HEAPTUPLESIZE - sizeof(tuplen)) != HEAPTUPLESIZE - sizeof(tuplen))
+ 		elog(ERROR, "unexpected end of data");
+ 	tuple->t_data = (HeapTupleHeader) ((char *) tuple + HEAPTUPLESIZE);
+ 	if (LogicalTapeRead(state->tapeset, tapenum,
+ 						tuple->t_data, tuple->t_len) != tuple->t_len)
+ 		elog(ERROR, "unexpected end of data");
+ 	if (state->randomAccess)	/* need trailing length word? */
+ 		if (LogicalTapeRead(state->tapeset, tapenum, &tuplen,
+ 							sizeof(tuplen)) != sizeof(tuplen))
+ 			elog(ERROR, "unexpected end of data");
+ 
+ 	stup->tuple = tuple;
+ 	extract_keys_rawheap(state, stup, tuple);
+ }
+ 
+ static void
+ extract_keys_rawheap(Tuplesortstate *state, SortTuple *stup, HeapTuple tuple)
+ {
+ 	/* set up first-column key value */
+ 	if (state->indexInfo->ii_Expressions == NULL)
+ 	{
+ 		/* no expression index, just get the key datum value */
+ 		stup->datum1 = heap_getattr((HeapTuple) stup->tuple,
+ 									state->indexInfo->ii_KeyAttrNumbers[0],
+ 									state->tupDesc,
+ 									&stup->isnull1);
+ 	}
+ 	else
+ 	{
+ 		/* Extract the index column values and nulls from the existing tuple. */
+ 		Datum			existing_values[INDEX_MAX_KEYS];
+ 		bool			existing_isnull[INDEX_MAX_KEYS];
+ 		TupleTableSlot *ecxt_scantuple;
+ 
+ 		ecxt_scantuple = GetPerTupleExprContext(state->estate)->ecxt_scantuple;
+ 		ExecStoreTuple(tuple, ecxt_scantuple, InvalidBuffer, false);
+ 		FormIndexDatum(state->indexInfo, ecxt_scantuple, state->estate,
+ 					   existing_values, existing_isnull);
+ 
+ 		stup->datum1 = existing_values[0];
+ 		stup->isnull1 = existing_isnull[0];
+ 	}
+ }
+ 
+ /*
   * Convenience routine to free a tuple previously loaded into sort memory
   */
  static void
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 63641b9..be60654 100644
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
*************** extern void cost_hashjoin(HashPath *path
*** 110,115 ****
--- 110,117 ----
  extern void cost_subplan(PlannerInfo *root, SubPlan *subplan, Plan *plan);
  extern void cost_qual_eval(QualCost *cost, List *quals, PlannerInfo *root);
  extern void cost_qual_eval_node(QualCost *cost, Node *qual, PlannerInfo *root);
+ extern void cost_index_scan_vs_seqscansort(Oid tableOid, Oid indexOid,
+ 			  Cost *indexScanCost, Cost *seqScanAndSortCost, int sort_mem);
  extern void set_baserel_size_estimates(PlannerInfo *root, RelOptInfo *rel);
  extern void set_joinrel_size_estimates(PlannerInfo *root, RelOptInfo *rel,
  						   RelOptInfo *outer_rel,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index d879ff0..b215bb5 100644
*** a/src/include/utils/tuplesort.h
--- b/src/include/utils/tuplesort.h
*************** extern Tuplesortstate *tuplesort_begin_i
*** 64,69 ****
--- 64,72 ----
  extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
  					  Oid sortOperator, bool nullsFirstFlag,
  					  int workMem, bool randomAccess);
+ extern Tuplesortstate *tuplesort_begin_rawheap(Relation indexRel,
+ 												TupleDesc tupDesc,
+ 												int workMem, bool randomAccess);
  
  extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
  
*************** extern void tuplesort_puttupleslot(Tuple
*** 72,77 ****
--- 75,81 ----
  extern void tuplesort_putindextuple(Tuplesortstate *state, IndexTuple tuple);
  extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
  				   bool isNull);
+ extern void tuplesort_putrawtuple(Tuplesortstate *state, HeapTuple tup);
  
  extern void tuplesort_performsort(Tuplesortstate *state);
  
*************** extern IndexTuple tuplesort_getindextupl
*** 81,86 ****
--- 85,92 ----
  						bool *should_free);
  extern bool tuplesort_getdatum(Tuplesortstate *state, bool forward,
  				   Datum *val, bool *isNull);
+ extern HeapTuple tuplesort_getrawtuple(Tuplesortstate *state, bool forward,
+ 				   bool *should_free);
  
  extern void tuplesort_end(Tuplesortstate *state);

#49

schmiddy@gmail.com

over 15 years ago

In reply to: Itagaki Takahiro (#48)

Re: I: About "Our CLUSTER implementation is pessimal" patch

On Mon, Sep 27, 2010 at 10:05 PM, Itagaki Takahiro
<itagaki.takahiro@gmail.com> wrote:

I re-ordered some description in the doc. Does it look better?
Comments and suggestions welcome.

I thought this paragraph was a little confusing:

! In the second case, a full table scan is followed by a sort operation.
! The method is faster than the first one when the table is highly
fragmented.
! You need twice disk space of the sum in the case. In addition to the free
! space needed by the previous case, this approach may also need a temporary
! disk sort file which can be as big as the original table.

I think the worst-case disk space could be made a little more clear
here, and maybe some general wordsmithing as well. I wasn't sure what
"twice disk space of the sum" was in this description -- sum of what
(table and all indexes?). And does "twice disk space" include the
temporary disk sort file? Here's an idea of how I think this paragraph
could be cleaned up a bit, if my understanding of the disk space
required is about right:

! In the second case, a full table scan is followed by a sort operation.
! This method is faster than when the table is highly fragmented.
! However, <command>CLUSTER</command> may require available disk space of
! up to twice the sum of the size of the table and its indexes, if
it uses a temporary
! disk sort file, which can be as big as the original table.

Also, AIUI, this second clustering method is similar to the older
idiom of CREATE TABLE new AS SELECT * FROM old ORDER BY col; Since the
paragraph describing this older idiom is being removed, perhaps a
brief mention in the documentation could be made of this similarity.

Some more wordsmithing: change
! The planner tries to choose a faster method in them base on the
information
to:
! The planner tries to choose the fastest method based on the information

I started looking at the performance impact of this patch based on
Leonardo's SQL file. On the 2 million row table, I see a consistent
~10% advantage for the sequential scan clusters. I'm going to try to
run the bigger tests a few times and post results from there when I
get a chance.

Josh

#50

alvherre@commandprompt.com

over 15 years ago

In reply to: Josh Kupershmidt (#49)

Re: I: About "Our CLUSTER implementation is pessimal" patch

Excerpts from Josh Kupershmidt's message of mar sep 28 23:53:33 -0400 2010:

I started looking at the performance impact of this patch based on
Leonardo's SQL file. On the 2 million row table, I see a consistent
~10% advantage for the sequential scan clusters. I'm going to try to
run the bigger tests a few times and post results from there when I
get a chance.

10% is nothing. I was expecting this patch would give an order of
magnitude of improvement or somethine like that in the worst cases of
the current code (highly unsorted input)

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#51

itagaki.takahiro@gmail.com

over 15 years ago

In reply to: Alvaro Herrera (#50)

Re: I: About "Our CLUSTER implementation is pessimal" patch

On Wed, Sep 29, 2010 at 1:27 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:

I see a consistent
~10% advantage for the sequential scan clusters.

10% is nothing. I was expecting this patch would give an order of
magnitude of improvement or somethine like that in the worst cases of
the current code (highly unsorted input)

Yes. It should be x10 faster than ordinary method in the worst cases.

I have a performance result of pg_reorg, that performs as same as
the patch. It shows 16 times faster than the old CLUSTER. In addition,
it was slow if not fragmented. (So, it should not be "consistent".)
http://reorg.projects.postgresql.org/

--
Itagaki Takahiro

#52

itagaki.takahiro@gmail.com

over 15 years ago

In reply to: Josh Kupershmidt (#49)

Re: I: About "Our CLUSTER implementation is pessimal" patch

On Wed, Sep 29, 2010 at 12:53 PM, Josh Kupershmidt <schmiddy@gmail.com> wrote:

I thought this paragraph was a little confusing:

Thanks for checking.

! In the second case, a full table scan is followed by a sort operation.
! The method is faster than the first one when the table is highly
fragmented.
! You need twice disk space of the sum in the case. In addition to the free
! space needed by the previous case, this approach may also need a temporary
! disk sort file which can be as big as the original table.

I think the worst-case disk space could be made a little more clear
here, and maybe some general wordsmithing as well. I wasn't sure what
"twice disk space of the sum" was in this description -- sum of what
(table and all indexes?).

To be exact, It's very complex.
During reconstructing tables, it requires about twice disk space of
the old table (for sort tapes and the new table).
After sorting the table, CLUSTER performs REINDEX. We need
{same size of the new table} + {twice disk space of the new indexes}.
Also, new relations will be the same sizes of old relations if they
have no free spaces.

So, I think "twice disk space of the sum of table and indexes" would be
the simplest explanation for safe margin.

Also, AIUI, this second clustering method is similar to the older
idiom of CREATE TABLE new AS SELECT * FROM old ORDER BY col; Since the
paragraph describing this older idiom is being removed, perhaps a
brief mention in the documentation could be made of this similarity.

Good idea.

Some more wordsmithing: change
! The planner tries to choose a faster method in them base on the
information
to:
! The planner tries to choose the fastest method based on the information

Thanks.

--
Itagaki Takahiro

#53

http://archives.postgresql.org/pgsql-hackers/2010-02/msg00766.php

m_lists@yahoo.it

over 15 years ago

In reply to: Itagaki Takahiro (#51)

Re: I: About "Our CLUSTER implementation is pessimal" patch

10% is nothing. I was expecting this patch would give an order of
magnitude of improvement or somethine like that in the worst cases of
the current code (highly unsorted input)

Yes. It should be x10 faster than ordinary method in the worst cases.

Here's my post with a (very simple) performance test:

The test I used wasn't a "worst case" scenario, since it is based on
random data, not wrong-ordered data. Obviously, the real difference
can be seen on large tables (5M+ rows), and/or slow disks.

Leonardo

#54

alvherre@commandprompt.com

over 15 years ago

In reply to: Leonardo Francalanci (#53)

Re: I: About "Our CLUSTER implementation is pessimal" patch

Excerpts from Leonardo Francalanci's message of mié sep 29 03:17:07 -0400 2010:

Here's my post with a (very simple) performance test:

http://archives.postgresql.org/pgsql-hackers/2010-02/msg00766.php

I think the 10M rows test is more in line with what we want (83s vs. 646).

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#55

alvherre@commandprompt.com

over 15 years ago

In reply to: Itagaki Takahiro (#52)

Re: I: About "Our CLUSTER implementation is pessimal" patch

Excerpts from Itagaki Takahiro's message of mié sep 29 01:25:38 -0400 2010:

To be exact, It's very complex.
During reconstructing tables, it requires about twice disk space of
the old table (for sort tapes and the new table).
After sorting the table, CLUSTER performs REINDEX. We need
{same size of the new table} + {twice disk space of the new indexes}.
Also, new relations will be the same sizes of old relations if they
have no free spaces.

Aren't the old table and indexes kept until transaction commit, though?
So during the reindex you still keep the old copy of the table and
indexes around, thus double space. The only space difference would be
extra free space in the old table, tuples older than OldestXmin, and
fillfactor changes.

So, I think "twice disk space of the sum of table and indexes" would be
the simplest explanation for safe margin.

Agreed.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#56

robertmhaas@gmail.com

over 15 years ago

In reply to: Itagaki Takahiro (#51)

Re: I: About "Our CLUSTER implementation is pessimal" patch

On Wed, Sep 29, 2010 at 1:12 AM, Itagaki Takahiro
<itagaki.takahiro@gmail.com> wrote:

On Wed, Sep 29, 2010 at 1:27 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:

I see a consistent
~10% advantage for the sequential scan clusters.

10% is nothing. I was expecting this patch would give an order of
magnitude of improvement or somethine like that in the worst cases of
the current code (highly unsorted input)

Yes. It should be x10 faster than ordinary method in the worst cases.

I have a performance result of pg_reorg, that performs as same as
the patch. It shows 16 times faster than the old CLUSTER. In addition,
it was slow if not fragmented. (So, it should not be "consistent".)
http://reorg.projects.postgresql.org/

Can you reproduce that with this patch?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#57

itagaki.takahiro@gmail.com

over 15 years ago

In reply to: Robert Haas (#56)

Re: I: About "Our CLUSTER implementation is pessimal" patch

On Wed, Sep 29, 2010 at 10:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:

http://reorg.projects.postgresql.org/

Can you reproduce that with this patch?

No, I can't use the machine anymore.

--
Itagaki Takahiro

#58

m_lists@yahoo.it

over 15 years ago

In reply to: Alvaro Herrera (#54)

Re: I: About "Our CLUSTER implementation is pessimal" patch

Here's my post with a (very simple) performance test:
http://archives.postgresql.org/pgsql-hackers/2010-02/msg00766.php

I think the 10M rows test is more in line with what we want (83s vs. 646).

Can someone else test the patch to see if what I found is still valid?
I don't think it makes much sense if I'm the only one that says
"this is faster" :)

#59

schmiddy@gmail.com

over 15 years ago

In reply to: Leonardo Francalanci (#58)

1 attachment(s)

Re: I: About "Our CLUSTER implementation is pessimal" patch

On Wed, Sep 29, 2010 at 11:55 AM, Leonardo Francalanci <m_lists@yahoo.it> wrote:

Can someone else test the patch to see if what I found is still valid?
I don't think it makes much sense if I'm the only one that says
"this is faster" :)

I ran a few more performance tests on this patch. Here's what I got
for the tests Leonardo posted originally:

* 2M rows: 22 seconds for seq. scan, 24 seconds for index scan
* 5M rows: 139 seconds for seq. scan, 97 seconds for index scan
* 10M rows: 256 seconds seq. scan, 611 seconds for index scan

(times are for the cluster operation only, not for the table
creations, etc. which took most of the time)

I tried a few more tests of creating a table with either 10M or 50M
rows, then deleting 90% of the rows and running a cluster. The patch
didn't fare so well here:

* 10M rows: 84 seconds for seq. scan, 44 seconds for index scan

The seq. scan results here were obtained with the patch applied, and
without using planner hints (enable_seqscan or enable_indexscan). I
added in an ereport() call to check that use_sort was actually true.
The index scan results were obtained without the patch applied. The
SQL file I used is attached.

So I think there are definitely cases where this patch helps, but it
looks like a seq. scan is being chosen in some cases where it doesn't
help.

Test machine: MacBook Pro laptop, C2D 2.53 GHz, 4GB RAM.
Settings: shared_buffers = 16MB, work_mem and maintenance_work_mem set
from the SQL scripts.

Josh

#60

Simon Riggs

simon@2ndQuadrant.com

over 15 years ago

In reply to: Alvaro Herrera (#55)

Re: I: About "Our CLUSTER implementation is pessimal" patch

On Wed, 2010-09-29 at 08:44 -0400, Alvaro Herrera wrote:

So, I think "twice disk space of the sum of table and indexes" would be
the simplest explanation for safe margin.

Agreed.

Surely the peak space is x3?

Old space + sort space + new space.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services

#61

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Simon Riggs (#60)

Re: I: About "Our CLUSTER implementation is pessimal" patch

Simon Riggs <simon@2ndQuadrant.com> writes:

On Wed, 2010-09-29 at 08:44 -0400, Alvaro Herrera wrote:
So, I think "twice disk space of the sum of table and indexes" would be
the simplest explanation for safe margin.

Agreed.

Surely the peak space is x3?
Old space + sort space + new space.

The wording should be something like "CLUSTER requires transient disk
space equal to about twice the size of the table plus its indexes".

regards, tom lane

#62

itagaki.takahiro@gmail.com

over 15 years ago

In reply to: Tom Lane (#61)

Re: I: About "Our CLUSTER implementation is pessimal" patch

Hi, Leonardo-san,

On Fri, Oct 1, 2010 at 4:04 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The wording should be something like "CLUSTER requires transient disk
space equal to about twice the size of the table plus its indexes".

Could you merge those discussions into the final patch?
Also, please check whether my modification broke your patch.
Thank you.

--
Itagaki Takahiro

#63

robertmhaas@gmail.com

over 15 years ago

In reply to: Itagaki Takahiro (#62)

Re: I: About "Our CLUSTER implementation is pessimal" patch

On Sep 30, 2010, at 9:07 PM, Itagaki Takahiro <itagaki.takahiro@gmail.com> wrote:

Hi, Leonardo-san,

On Fri, Oct 1, 2010 at 4:04 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The wording should be something like "CLUSTER requires transient disk
space equal to about twice the size of the table plus its indexes".

Could you merge those discussions into the final patch?
Also, please check whether my modification broke your patch.
Thank you.

It sounds like the costing model might need a bit more work before we commit this.

...Robert

#64

m_lists@yahoo.it

over 15 years ago

In reply to: Josh Kupershmidt (#59)

Re: I: About "Our CLUSTER implementation is pessimal" patch

I ran a few more performance tests on this patch. Here's what I got
for the tests Leonardo posted originally:
* 2M rows: 22 seconds for seq. scan, 24 seconds for index scan
* 5M rows: 139 seconds for seq. scan, 97 seconds for index scan
* 10M rows: 256 seconds seq. scan, 611 seconds for index scan

I don't have time right now to run more tests, I'll try to make some by
next week.

Would it mean that doing:

create table p as select * from atable order by akey

(where akey is random distributed)
with 5M rows is faster with enable_seqscan=0 and
enable_indexscan=1??? That would be weird, especially on a
laptop hard drive! (assuming there's a reasonable amount of
memory set in work_mem/maintenance_work_mem)

I tried a few more tests of creating a table with either 10M or 50M
rows, then deleting 90% of the rows and running a cluster. The patch
didn't fare so well here:

* 10M rows: 84 seconds for seq. scan, 44 seconds for index scan

[...]

So I think there are definitely cases where this patch helps, but it
looks like a seq. scan is being chosen in some cases where it doesn't
help.

Test machine: MacBook Pro laptop, C2D 2.53 GHz, 4GB RAM.

Again: what would the planner choose in that case for a:

create table p as select * from mybloat order by myid

???

I guess that if the planner makes a wrong choice in this case (that is,
seq scan + sort instead of index scan) there's no way for "cluster" to
behave in a different way. If, on the contrary, the "create table..." uses
the right plan, and cluster doesn't, we have a problem in the patch.
Am I right?

#65

robertmhaas@gmail.com

over 15 years ago

In reply to: Leonardo Francalanci (#64)

Re: I: About "Our CLUSTER implementation is pessimal" patch

On Fri, Oct 1, 2010 at 4:33 AM, Leonardo Francalanci <m_lists@yahoo.it> wrote:

I guess that if the planner makes a wrong choice in this case (that is,
seq scan + sort instead of index scan) there's no way for "cluster" to
behave in a different way. If, on the contrary, the "create table..." uses
the right plan, and cluster doesn't, we have a problem in the patch.
Am I right?

Good point. I think you're right.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

#66

schmiddy@gmail.com

over 15 years ago

In reply to: Leonardo Francalanci (#64)

Re: I: About "Our CLUSTER implementation is pessimal" patch

On Fri, Oct 1, 2010 at 4:33 AM, Leonardo Francalanci <m_lists@yahoo.it> wrote:

I ran a few more performance tests on this patch. Here's what I got
for the tests Leonardo posted originally:
* 2M rows: 22 seconds for seq. scan, 24 seconds for index scan
* 5M rows: 139 seconds for seq. scan, 97 seconds for index scan
* 10M rows: 256 seconds seq. scan, 611 seconds for index scan

I don't have time right now to run more tests, I'll try to make some by
next week.

Would it mean that doing:

create table p as select * from atable order by akey

(where akey is random distributed)
with 5M rows is faster with enable_seqscan=0 and
enable_indexscan=1??? That would be weird, especially on a
laptop hard drive! (assuming there's a reasonable amount of
memory set in work_mem/maintenance_work_mem)

Hrm, this is interesting. I set up a test table with 5M rows like so:

CREATE TABLE atable (
akey int
);
INSERT INTO atable (akey)
SELECT (RANDOM() * 100000)::int FROM generate_series(1,5000000);
CREATE INDEX akey_idx ON atable(akey);
ANALYZE atable;

And then I tested table creation times. First, using a normal:

BEGIN;
SET enable_seqscan = on;
SET enable_indexscan = on;
EXPLAIN ANALYZE CREATE TABLE idxscanned AS SELECT * FROM atable
ORDER BY akey;
ROLLBACK;

and I get:
Index Scan using akey_idx on atable
(cost=0.00..218347.89 rows=5000000 width=4)
(actual time=0.058..23612.020 rows=5000000 loops=1)
Total runtime: 33029.884 ms

Then, I tried forcing a sequential scan by changing "SET
enable_indexscan = off;", and it's significantly faster, as I would
expect:

Sort (cost=696823.42..709323.42 rows=5000000 width=4)
(actual time=8664.699..13533.131 rows=5000000 loops=1)
Sort Key: akey
Sort Method: external merge Disk: 68304kB
-> Seq Scan on atable (cost=0.00..72124.00 rows=5000000 width=4)
(actual time=0.012..838.092 rows=5000000 loops=1)
Total runtime: 21015.501 ms

I've ran both of these several times, and get 30-32 seconds for the
index scan and 20-21 seconds for the seq. scan each time.

My seq_page_cost and random_page_cost were left at the defaults for
these tests. Oddly, I tried turning seq_page_cost down to 0.01 and
EXPLAIN ANALYZE told me that an index scan was still being chosen. Is
there maybe some other setting I'm forgetting?

Josh

#67

itagaki.takahiro@gmail.com

over 15 years ago

In reply to: Josh Kupershmidt (#66)

Re: I: About "Our CLUSTER implementation is pessimal" patch

On Sat, Oct 2, 2010 at 9:36 AM, Josh Kupershmidt <schmiddy@gmail.com> wrote:

Hrm, this is interesting. I set up a test table with 5M rows like so:

Such discussions are for the planner itself, right? The sorted cluster
patch uses the existing planner's costing model, so we can discuss the
clustering feature and the improvement of planner in different patches.

My seq_page_cost and random_page_cost were left at the defaults for
these tests. Oddly, I tried turning seq_page_cost down to 0.01 and
EXPLAIN ANALYZE told me that an index scan was still being chosen. Is
there maybe some other setting I'm forgetting?

It might come from effective_cache_size. We consider the value only
in the index scans. We can also use the effective cache in addition
to work_mem for tapes used by disk sorting, but we don't consider
the effective cache for now.

--
Itagaki Takahiro

#68

robertmhaas@gmail.com

over 15 years ago

In reply to: Josh Kupershmidt (#66)

Re: I: About "Our CLUSTER implementation is pessimal" patch

On Oct 1, 2010, at 8:36 PM, Josh Kupershmidt <schmiddy@gmail.com> wrote:

On Fri, Oct 1, 2010 at 4:33 AM, Leonardo Francalanci <m_lists@yahoo.it> wrote:

I ran a few more performance tests on this patch. Here's what I got
for the tests Leonardo posted originally:
* 2M rows: 22 seconds for seq. scan, 24 seconds for index scan
* 5M rows: 139 seconds for seq. scan, 97 seconds for index scan
* 10M rows: 256 seconds seq. scan, 611 seconds for index scan

I don't have time right now to run more tests, I'll try to make some by
next week.

Would it mean that doing:

create table p as select * from atable order by akey

(where akey is random distributed)
with 5M rows is faster with enable_seqscan=0 and
enable_indexscan=1??? That would be weird, especially on a
laptop hard drive! (assuming there's a reasonable amount of
memory set in work_mem/maintenance_work_mem)

Hrm, this is interesting. I set up a test table with 5M rows like so:

CREATE TABLE atable (
akey int
);
INSERT INTO atable (akey)
SELECT (RANDOM() * 100000)::int FROM generate_series(1,5000000);
CREATE INDEX akey_idx ON atable(akey);
ANALYZE atable;

And then I tested table creation times. First, using a normal:

BEGIN;
SET enable_seqscan = on;
SET enable_indexscan = on;
EXPLAIN ANALYZE CREATE TABLE idxscanned AS SELECT * FROM atable
ORDER BY akey;
ROLLBACK;

and I get:
Index Scan using akey_idx on atable
(cost=0.00..218347.89 rows=5000000 width=4)
(actual time=0.058..23612.020 rows=5000000 loops=1)
Total runtime: 33029.884 ms

Then, I tried forcing a sequential scan by changing "SET
enable_indexscan = off;", and it's significantly faster, as I would
expect:

Sort (cost=696823.42..709323.42 rows=5000000 width=4)
(actual time=8664.699..13533.131 rows=5000000 loops=1)
Sort Key: akey
Sort Method: external merge Disk: 68304kB
-> Seq Scan on atable (cost=0.00..72124.00 rows=5000000 width=4)
(actual time=0.012..838.092 rows=5000000 loops=1)
Total runtime: 21015.501 ms

I've ran both of these several times, and get 30-32 seconds for the
index scan and 20-21 seconds for the seq. scan each time.

My seq_page_cost and random_page_cost were left at the defaults for
these tests. Oddly, I tried turning seq_page_cost down to 0.01 and
EXPLAIN ANALYZE told me that an index scan was still being chosen. Is
there maybe some other setting I'm forgetting?

Did you also adjust random_page_cost?

...Robert

#69

m_lists@yahoo.it

over 15 years ago

In reply to: Robert Haas (#63)

Re: I: About "Our CLUSTER implementation is pessimal" patch

It sounds like the costing model might need a bit more work before we commit
this.

I tried again the simple sql tests I posted a while ago, and I still get the
same ratios.
I've tested the applied patch on a dual opteron + disk array Solaris machine.

I really don't get how a laptop hard drive can be faster at reading data using
random
seeks (required by the original cluster method) than seq scan + sort for the 5M
rows
test case.
Same thing for the "cluster vs bloat" test: the seq scan + sort is faster on my
machine.

I've just noticed that Josh used shared_buffers = 16MB for the "cluster vs
bloat" test:
I'm using a much higher shared_buffers (I think something like 200MB), since if
you're working with tables this big I thought it could be a more appropriate
value.
Maybe that's the thing that makes the difference???

Can someone else test the patch?

And: I don't have that deep knowledge of how postgresql deletes rows; but I
thought
that something like:

DELETE FROM mybloat WHERE RANDOM() < 0.9;

would only delete data, not indexes; so the patch should perform even better in
this
case (as it does, in fact, on my test machine), as:

- the original cluster method would read the whole index, and fetch only the
"still alive"
rows
- the new method would read the table using a seq scan, and sort in memory the
few
rows still alive

But, as I said, maybe I'm getting this part wrong...

#70

schmiddy@gmail.com

over 15 years ago

In reply to: Robert Haas (#68)

Re: I: About "Our CLUSTER implementation is pessimal" patch

On Mon, Oct 4, 2010 at 8:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Did you also adjust random_page_cost?

No, my seq_page_cost (1) and random_page_cost (4) are from the
defaults. Here are all my non-default settings:

#71

schmiddy@gmail.com

over 15 years ago

In reply to: Leonardo Francalanci (#69)

Re: I: About "Our CLUSTER implementation is pessimal" patch

On Mon, Oct 4, 2010 at 4:47 PM, Leonardo Francalanci <m_lists@yahoo.it> wrote:

It sounds like the costing model might need a bit more work before we commit
this.

I tried again the simple sql tests I posted a while ago, and I still get the
same ratios.
I've tested the applied patch on a dual opteron + disk array Solaris machine.

I really don't get how a laptop hard drive can be faster at reading data using
random
seeks (required by the original cluster method) than seq scan + sort for the 5M
rows
test case.
Same thing for the "cluster vs bloat" test: the seq scan + sort is faster on my
machine.

Well, my last tests showed that the planner was choosing an index scan
for queries like:

SELECT * FROM atable ORDER BY akey;

but forcing a seqscan + sort made this faster, as you expect. So I was
thinking my cost settings (posted upthread) probably need some
tweaking, unless it's a problem with the optimizer. But all of this is
unrelated to the patch.

[... pokes a bit more ...] Sigh, now I'm finding it impossible to
reproduce my own results, particulary the earlier cluster_vs_bloat.sql
test of:

* 10M rows: 84 seconds for seq. scan, 44 seconds for index scan

I'm getting about 5 seconds now for the cluster, both with and without
the patch. effective_cache_size doesn't seem to impact this much. I'll
have another look when I have some more time.

Josh

#72

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Josh Kupershmidt (#59)

Re: I: About "Our CLUSTER implementation is pessimal" patch

Josh Kupershmidt <schmiddy@gmail.com> writes:

So I think there are definitely cases where this patch helps, but it
looks like a seq. scan is being chosen in some cases where it doesn't
help.

I've been poking through this patch, and have found two different ways
in which it underestimates the cost of the seqscan case:

* it's not setting rel->width, resulting in an underestimate of the
amount of disk space needed for a sort; this would get worse for wider
tables.

* it's not allowing for the cost of recomputing index expression values
during comparisons. That doesn't matter of course if you're not testing
the index-expression case (which other infelicities suggest hasn't
exactly been stressed yet).

I suspect the first of these might have something to do with your
observation. AFAIR the width value isn't used in estimating indexscan
cost, so this omission would bias it in favor of seqscans, as soon as
the data volume exceeded maintenance_work_mem.

regards, tom lane

#73

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Itagaki Takahiro (#48)

Re: I: About "Our CLUSTER implementation is pessimal" patch

Itagaki Takahiro <itagaki.takahiro@gmail.com> writes:

I re-ordered some description in the doc. Does it look better?
Comments and suggestions welcome.

Applied with some significant editorialization. The biggest problem I
found was that the code for expression indexes didn't really work, and
would leak memory like there's no tomorrow even when it did work.
I fixed that, but I think the performance is still going to be pretty
undesirable. We have to re-evaluate the index expressions for each
tuple each time we do a comparison, which means it's going to be really
really slow unless the index expressions are *very* cheap. But perhaps
the use-case for clustering on expression indexes is small enough that
this isn't worth worrying about.

I considered computing the index expressions just once as the data is
being fed in, and including their values in the tuples-to-be-sorted;
that would cut the number of times the values have to be computed by
a factor of about log N. But it'd also bloat the on-disk sort data,
which could possibly cost more in I/O than we save. So it's not real
clear what to do anyway.

regards, tom lane

#74

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Takahiro Itagaki (#42)

Re: I: About "Our CLUSTER implementation is pessimal" patch

Takahiro Itagaki <itagaki.takahiro@oss.ntt.co.jp> writes:

BTW, we could have LogicalTapeReadExact() as an alias of
LogicalTapeRead() and checking the result because we have
many duplicated codes for "unexpected end of data" errors.

Good idea, done.

regards, tom lane

#75

tgl@sss.pgh.pa.us

over 15 years ago

In reply to: Itagaki Takahiro (#46)

Re: I: About "Our CLUSTER implementation is pessimal" patch

Itagaki Takahiro <itagaki.takahiro@gmail.com> writes:

I wrote a patch to improve CLUSTER VERBOSE (and VACUUM FULL VERBOSE).
The patch should be applied after sorted_cluster-20100721.patch .

Applied with minor fixes; in particular, I think you got the effects of
rewrite_heap_dead_tuple backwards. When a tuple is removed from
unresolved_tups, that amounts to changing its status from "recently
dead" to "dead and will not be written out".

regards, tom lane

#76

itagaki.takahiro@gmail.com

over 15 years ago

In reply to: Tom Lane (#75)

Re: I: About "Our CLUSTER implementation is pessimal" patch

On Fri, Oct 8, 2010 at 10:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Itagaki Takahiro <itagaki.takahiro@gmail.com> writes:

I wrote a patch to improve CLUSTER VERBOSE (and VACUUM FULL VERBOSE).
The patch should be applied after sorted_cluster-20100721.patch .

Applied with minor fixes; in particular, I think you got the effects of
rewrite_heap_dead_tuple backwards. When a tuple is removed from
unresolved_tups, that amounts to changing its status from "recently
dead" to "dead and will not be written out".

Ah, yes. I misunderstood the behavior. Thanks for the fix!

--
Itagaki Takahiro

#77