Progress on fast path sorting, btree index creation time

peter@2ndquadrant.com

about 14 years ago

In reply to: Merlin Moncure (#2)

1 attachment(s)

Re: Progress on fast path sorting, btree index creation time

On 30 December 2011 19:46, Merlin Moncure <mmoncure@gmail.com> wrote:

On Thu, Dec 29, 2011 at 8:03 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:

* A spreadsheet that shows the results of re-running my earlier heap
tuple sorting benchmark with this new patch. The improvement in the
query that orders by 2 columns is all that is pertinent there, when
considering the value of (1) and the sense in standing still for
controversy A.

* A spreadsheet that shows the difference in index creation times,
generated with the help of the new python script.

very nice. let me save everyone the effort of opening his
spreadsheets (which by the way both show 'HEAD/unoptimized' --
probably not what you meant): he's showing a consistent ~50% reduction
in running time of sort driven queries -- that's money.

Sorry, I think you may have misinterpreted the results, which is my
fault - I introduced a formatting error. In the case of the "btree"
spreadsheet, the first query on each sheet should be "create index
test on orderlines (prod_id);", and not "select * from orderlines
order by prod_id". The idea is to compare the results from each set of
binaries across pages of the spreadsheet (note that there are two
tabs). You should not compare anything between the two spreadsheets.
Revised btree results attached. The heap results that I posted do not
have any formatting errors, so they have not been revised.

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Merlin Moncure

mmoncure@gmail.com

about 14 years ago

In reply to: Peter Geoghegan (#3)

Re: Progress on fast path sorting, btree index creation time

On Fri, Dec 30, 2011 at 2:30 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:

On 30 December 2011 19:46, Merlin Moncure <mmoncure@gmail.com> wrote:

On Thu, Dec 29, 2011 at 8:03 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:

* A spreadsheet that shows the results of re-running my earlier heap
tuple sorting benchmark with this new patch. The improvement in the
query that orders by 2 columns is all that is pertinent there, when
considering the value of (1) and the sense in standing still for
controversy A.

* A spreadsheet that shows the difference in index creation times,
generated with the help of the new python script.

very nice. let me save everyone the effort of opening his
spreadsheets (which by the way both show 'HEAD/unoptimized' --
probably not what you meant): he's showing a consistent ~50% reduction
in running time of sort driven queries -- that's money.

Sorry, I think you may have misinterpreted the results, which is my
fault - I introduced a formatting error. In the case of the "btree"
spreadsheet, the first query on each sheet should be "create index
test on orderlines (prod_id);", and not "select * from orderlines
order by prod_id". The idea is to compare the results from each set of
binaries across pages of the spreadsheet (note that there are two
tabs). You should not compare anything between the two spreadsheets.
Revised btree results attached. The heap results that I posted do not
have any formatting errors, so they have not been revised.

right-- my bad. still, that's 31-37% -- still pretty nice.

merlin

robertmhaas@gmail.com

about 14 years ago

In reply to: Peter Geoghegan (#1)

Re: Progress on fast path sorting, btree index creation time

On Thu, Dec 29, 2011 at 9:03 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:

3. Resolve two anticipated controversies that are, respectively,
somewhat orthogonal and completely orthogonal to the binary bloat
controversy. The first (controversy A) is that I have added a new
piece of infrastructure, pg_always_inline, which, as the name
suggests, is a portable way of insisting that a function should be
invariably inlined. Portable libraries like libc have been doing this
for a long time now, and I actually use the libc macro (that expands
to __attribute__((always_inline)) ) where possible.

I don't have a problem with the idea of a pg_always_inline, but I'm
wondering what sort of fallback mechanism you propose. It seems to me
that if the performance improvement is conditioned on inlining be done
and we're not confident that we can force the compiler to inline, we
ought to fall back to not making the specialization available, rather
than to making something available that may not work well. Of course
if it's only a question of a smaller performance gain, rather than an
actual loss, then that's not as much of an issue...

The second
possible point of contention (controversy B) is that I have jettisoned
various protections against bad qsort implementations that I believe
are a legacy of when we used the system qsort pre-2006, that can no
longer be justified. For example, quick sort performing badly in the
face of lots of duplicates is a well understood problem today
(partitioning should stop on equal keys), and ISTM that protecting
against that outside the qsort implementation (but only for index
sorts) is wrong-headed.

Assuming you mean that qsort_arg() will protect against this stuff so
that callers don't need to do it separately, +1.

I had another idea when writing this patch that I haven't developed at
all but I'll share anyway. That is, it might be a good idea to use a
balance quicksort:

http://xlinux.nist.gov/dads//HTML/balancedqsrt.html

I can't find any description of how this actually works... obviously,
we'd like to find a close-to-median element, but how do we actually do
that cheaply?

It might be possible to get a reasonable approximation of the actual
median value of a given column with existing statistics, which could
be hinted to qsort_arg.

I doubt that this would be worthwhile -- the pivot that we pick at the
toplevel doesn't really matter much; even if it's bad, we can recover
just fine if the pivot we pick one level down is good, or the level
after that. We only really have a problem if we keep systematically
picking bad pivots every time. And if we do have that problem,
improving the toplevel choice of pivot is only going to help modestly,
because we'll still be O(n^2) on each partition.

Can I get a straw poll on how much of a problem worst-case performance
of qsort is believed to be?

I'd be reluctant to remove any of the protections we currently have,
because I bet they were all put in to fix problems that people hit in
the field. But I don't know that we need more. Of course,
consolidating them into qsort() itself rather than its callers
probably makes sense, as noted above.

In a perfect world, if it were somehow possible to know the perfect
pivots ahead of time from a histogram or something, we'd have a quick
sort variant with worst-case performance of O(n log(n)). That, or the
limited approximation that I've sketched would perhaps be worthwhile,
even if it was subject to a number of caveats. Wikipedia claims that
the worst case for quick sort is O(n log(n)) with the refinements
recommended by Sedgewick's 1978 paper, but this seems like nonsense to
me - the med3 technique is a good heuristic in practice, but it's
perfectly possible in theory for it's sampling to always get things
completely wrong (this is not an unfamiliar problem).

I agree. After reading
http://nanxkurniawan.wordpress.com/2010/05/31/quicksort/ I am inclined
to think that things are still O(n lg n) even if we always partition
so that any fixed percentage of the data -- is on one side of the
partition element and the remainder is on the other side. So for
example if all of our partition elements are greater than 1% of the
elements in their partition and less than the other 99%, we'll still
be O(n lg n), though with a considerably higher constant factor, I
think. However, if our partition elements are always a fixed NUMBER
of elements from the end - whether it's 1 or 100 - then we'll be
O(n^2), though again the constant factor will depend on how many
elements you knock off each time. In practice I'm not sure whether an
algorithmic analysis matters much, because in real life the constant
factors are probably pretty important, hence the median-of-medians
approach with n > 40.

While I'm thinking out loud, here's another idea: Have a GUC which,
when enabled (perhaps potentially at various different granularities),
makes Timsort the in-memory sorting algorithm, as it now is by default
for Python, Java SE7 (for arrays) and the Android platform; for
certain applications, this could be a real win, and I don't think that
it would have to be invasive: it could be dropped in.

I gather from a quick read that this is supposed to be especially good
when the data is already somewhat sorted. Would there be any merit in
trying to guess when that's likely to be the case (e.g. based on the
correlation statistic)? That seems like a stretch, but OTOH a GUC
doesn't feel great either: what is the poor user to do with a query
that does 2 sorts, one of which is faster with QS and the other of
which is faster with Timsort? Ugh.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

tgl@sss.pgh.pa.us

about 14 years ago

In reply to: Robert Haas (#5)

Re: Progress on fast path sorting, btree index creation time

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Dec 29, 2011 at 9:03 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:

The first (controversy A) is that I have added a new
piece of infrastructure, pg_always_inline, which, as the name
suggests, is a portable way of insisting that a function should be
invariably inlined.

I don't have a problem with the idea of a pg_always_inline, but I'm
wondering what sort of fallback mechanism you propose.

There is no compiler anywhere that implements "always inline", unless
you are talking about a macro. "inline" is a hint and nothing more,
and if you think you can force it you are mistaken. So this controversy
is easily resolved: we do not need any such construct.

The real question is whether we should accept a patch that is a
performance loss when the compiler fails to inline some reasonably
simple function. I think that would depend on the actual numbers
involved, so we'd need to see data before making a decision.

The second
possible point of contention (controversy B) is that I have jettisoned
various protections against bad qsort implementations that I believe
are a legacy of when we used the system qsort pre-2006, that can no
longer be justified.

No objection to that one, I think, as long as you've verified that our
implementation is in fact okay about these things.

regards, tom lane

http://msdn.microsoft.com/en-us/library/z8y1yy88.aspx

peter@2ndquadrant.com

about 14 years ago

In reply to: Tom Lane (#6)

Re: Progress on fast path sorting, btree index creation time

On 5 January 2012 22:27, Tom Lane <tgl@sss.pgh.pa.us> wrote:

There is no compiler anywhere that implements "always inline", unless
you are talking about a macro. "inline" is a hint and nothing more,
and if you think you can force it you are mistaken. So this controversy
is easily resolved: we do not need any such construct.

I'm slightly puzzled by your remarks here. GCC documentation on this
is sparse (although, as I've demonstrated, I can get better numbers
using the always_inline attribute on GCC 4.3), but take a look at
this:

While it is not strictly true to say that pg_always_inline could
inline every function under every set of circumstances, it's pretty
close to the truth. I do accept that this facility could quite easily
be abused if its use isn't carefully measured. I also accept that
C99/GNU C inline functions are generally just requests to the compiler
that may be ignored (and indeed the compiler may inline without being
asked to).

It's short sighted to see this as a case of inlining itself making a
big difference, so much as it making a big difference as an enabling
transformation.

The real question is whether we should accept a patch that is a
performance loss when the compiler fails to inline some reasonably
simple function. I think that would depend on the actual numbers
involved, so we'd need to see data before making a decision.

Who said anything about a performance loss? Since the raw improvement
to qsort_arg is so large, it's pretty difficult to imagine a
confluence of circumstances in which this results in a net loss. See
the figures at http://archives.postgresql.org/pgsql-hackers/2011-11/msg01316.php
, for example.

The pg_always_inline idea is relatively recent. It just serves to
provide additional encouragement to the compiler to inline, insofar as
that is possible on the platform in question. The compiler's
cost/benefit analysis cannot possibly appreciate how hot a code path
qsort_arg is, because it has a set of generic heuristics that are
quite fallible, and very probably are on quite conservative. We can
appreciate such things though.

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

robertmhaas@gmail.com

about 14 years ago

In reply to: Tom Lane (#6)

Re: Progress on fast path sorting, btree index creation time

On Thu, Jan 5, 2012 at 5:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

There is no compiler anywhere that implements "always inline", unless
you are talking about a macro. "inline" is a hint and nothing more,
and if you think you can force it you are mistaken. So this controversy
is easily resolved: we do not need any such construct.

That's basically in direct contradiction to the experimental evidence.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

peter@2ndquadrant.com

about 14 years ago

In reply to: Robert Haas (#5)

Re: Progress on fast path sorting, btree index creation time

On 5 January 2012 20:23, Robert Haas <robertmhaas@gmail.com> wrote:

I don't have a problem with the idea of a pg_always_inline, but I'm
wondering what sort of fallback mechanism you propose. It seems to me
that if the performance improvement is conditioned on inlining be done
and we're not confident that we can force the compiler to inline, we
ought to fall back to not making the specialization available, rather
than to making something available that may not work well. Of course
if it's only a question of a smaller performance gain, rather than an
actual loss, then that's not as much of an issue...

pg_always_inline is more a less a neat adjunct to what I've already
done. You can take it or leave it, but it would be irrational to
dismiss it out of hand, because it is demonstrably effective in this
case. Using non-portable things like function attributes is fairly
well precedented - we use __attribute__((format(*) )) frequently. We
don't need a fallback mechanism other than "#else #define
pg_always_inline inline #endif". I think these facilities are
available to well over 99% of our users, who use GCC, GCC compatible
compiler and MSVC builds.

I have basically no sympathy for anyone who uses a compiler that can't
inline functions. Do such people actually exist?

ISTM that protecting
against that outside the qsort implementation (but only for index
sorts) is wrong-headed.

Assuming you mean that qsort_arg() will protect against this stuff so
that callers don't need to do it separately, +1.

That's exactly what I mean. Going quadratic in the face of lot of
duplicates is something that, remarkably, was a problem well into the
1990s, apparently because some guy noticed it one day, though I think
that lots of popular implementations happened to be unaffected anyway.

As you know, all queries tested have lots and lots of duplicates (a
~1.5GB table that contains the same number of distinct elements as a
~10MB table once did), and removing the "duplicate protection" for
btrees, on top of everything else, sees the qsort do an awful lot
better than HEAD does with the psuedo-protection. As I've said, we
already lack any such "protection" for heap tuples. We are unaffected
now anyway, in particular, because we do this, as apparently
recommended by both Sedgewick and Knuth:

while (pb <= pc && (r = cmp(pb, a)) <= 0)
{
if (r == 0)
{
swap(pa, pb);
pa += es;
}
pb += es;
}

Note that the partition swap occurs *because* the pivot and element
are equal. You might imagine that this is superfluous. It turns out
that it protects against the duplicates, resulting in sub-partitions
about the same size (though it occurs to me that it does so at the
expense of precluding parallelism). In short, Sedgewick would approve
of our qsort implementation.

As for the "compare a tuple to itself" thing, that's just silly, any
was probably only ever seen in some homespun implementation at some
point a relatively long time ago, but I've asserted against it anyway.

I can't find any description of how this actually works... obviously,
we'd like to find a close-to-median element, but how do we actually do
that cheaply?

Uh, it works whatever way you want it to work. The implementation has
to figure out how to get the median ahead of time.

I doubt that this would be worthwhile -- the pivot that we pick at the
toplevel doesn't really matter much; even if it's bad, we can recover
just fine if the pivot we pick one level down is good, or the level
after that. We only really have a problem if we keep systematically
picking bad pivots every time. And if we do have that problem,
improving the toplevel choice of pivot is only going to help modestly,
because we'll still be O(n^2) on each partition.

Can I get a straw poll on how much of a problem worst-case performance
of qsort is believed to be?

I'd be reluctant to remove any of the protections we currently have,
because I bet they were all put in to fix problems that people hit in
the field. But I don't know that we need more. Of course,
consolidating them into qsort() itself rather than its callers
probably makes sense, as noted above.

I was merely proposing something that would compliment the med3 method
and our existing protections.

In a perfect world, if it were somehow possible to know the perfect
pivots ahead of time from a histogram or something, we'd have a quick
sort variant with worst-case performance of O(n log(n)). That, or the
limited approximation that I've sketched would perhaps be worthwhile,
even if it was subject to a number of caveats. Wikipedia claims that
the worst case for quick sort is O(n log(n)) with the refinements
recommended by Sedgewick's 1978 paper, but this seems like nonsense to
me - the med3 technique is a good heuristic in practice, but it's
perfectly possible in theory for it's sampling to always get things
completely wrong (this is not an unfamiliar problem).

I agree. After reading
http://nanxkurniawan.wordpress.com/2010/05/31/quicksort/ I am inclined
to think that things are still O(n lg n) even if we always partition
so that any fixed percentage of the data -- is on one side of the
partition element and the remainder is on the other side. So for
example if all of our partition elements are greater than 1% of the
elements in their partition and less than the other 99%, we'll still
be O(n lg n), though with a considerably higher constant factor, I
think. However, if our partition elements are always a fixed NUMBER
of elements from the end - whether it's 1 or 100 - then we'll be
O(n^2), though again the constant factor will depend on how many
elements you knock off each time. In practice I'm not sure whether an
algorithmic analysis matters much, because in real life the constant
factors are probably pretty important, hence the median-of-medians
approach with n > 40.

Yeah, it's true that you have to be exceptionally unlucky to actually
see worst-case performance, but I imagined that it might be enough of
a difference to be worth pursuing. Experimentally, I can certainly see
the difference (I simulated it), but it is underwhelming next to
everything else.

Might be worth simplifying the swap logic, given as we only ever have
to swap SortTuple structs in qsort_arg, and given that it would make
the meta qsort_arg that much less ugly.

While I'm thinking out loud, here's another idea: Have a GUC which,
when enabled (perhaps potentially at various different granularities),
makes Timsort the in-memory sorting algorithm, as it now is by default
for Python, Java SE7 (for arrays) and the Android platform; for
certain applications, this could be a real win, and I don't think that
it would have to be invasive: it could be dropped in.

I gather from a quick read that this is supposed to be especially good
when the data is already somewhat sorted. Would there be any merit in
trying to guess when that's likely to be the case (e.g. based on the
correlation statistic)? That seems like a stretch, but OTOH a GUC
doesn't feel great either: what is the poor user to do with a query
that does 2 sorts, one of which is faster with QS and the other of
which is faster with Timsort? Ugh.

I'd imagined that it might work a bit like default_statistics_target.
Of course, I was just thinking out loud. Also, we do a very good job
on *perfectly* pre-sorted input because we check for pre-sorted input.

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

#10

robertmhaas@gmail.com

about 14 years ago

In reply to: Peter Geoghegan (#9)

Re: Progress on fast path sorting, btree index creation time

On Fri, Jan 6, 2012 at 12:10 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:

As you know, all queries tested have lots and lots of duplicates (a
~1.5GB table that contains the same number of distinct elements as a
~10MB table once did), and removing the "duplicate protection" for
btrees, on top of everything else, sees the qsort do an awful lot
better than HEAD does with the psuedo-protection.

Does that also win vs. unpatched master? If so we may as well pull
that part out and commit it separately.

I can't find any description of how this actually works... obviously,
we'd like to find a close-to-median element, but how do we actually do
that cheaply?

Uh, it works whatever way you want it to work. The implementation has
to figure out how to get the median ahead of time.

Oh. I'd be inclined to think that would be pretty inefficient, unless
we only did it for very large partitions or when we observed that some
less costly strategy was tanking.

I gather from a quick read that this is supposed to be especially good
when the data is already somewhat sorted. Would there be any merit in
trying to guess when that's likely to be the case (e.g. based on the
correlation statistic)? That seems like a stretch, but OTOH a GUC
doesn't feel great either: what is the poor user to do with a query
that does 2 sorts, one of which is faster with QS and the other of
which is faster with Timsort? Ugh.

I'd imagined that it might work a bit like default_statistics_target.
Of course, I was just thinking out loud. Also, we do a very good job
on *perfectly* pre-sorted input because we check for pre-sorted input.

We occasionally have people who want to do SELECT * FROM foo WHERE a >
100 AND a < 200 ORDER BY a, b where foo has an index on (a) but not
(a, b). This tends to suck, because we fail to realize that we can
batch the sort. We either seqscan and filter the table then sort the
results, or else we scan the index on (a) and then ignore the fact
that we have data which is already partially sorted. It's
particularly annoying if a is "mostly unique" and needs b only
occasionally to break ties. It would be nice to improve this, but it
may be more of a planner/executor problem that something we can fix
directly inside the sort implementation.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11

peter@2ndquadrant.com

about 14 years ago

In reply to: Robert Haas (#10)

Re: Progress on fast path sorting, btree index creation time

On 6 January 2012 17:29, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jan 6, 2012 at 12:10 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:

As you know, all queries tested have lots and lots of duplicates (a
~1.5GB table that contains the same number of distinct elements as a
~10MB table once did), and removing the "duplicate protection" for
btrees, on top of everything else, sees the qsort do an awful lot
better than HEAD does with the psuedo-protection.

Does that also win vs. unpatched master? If so we may as well pull
that part out and commit it separately.

I didn't bother isolating that, because it doesn't really make sense
to (not having it is probably only of particular value when doing what
I'm doing anyway, but who knows). Go ahead and commit something to
remove that code (get it in both comparetup_index_btree and
comparetup_index_hash), as well as the tuple1 != tuple2 test now if
you like. It's patently clear that it is unnecessary, and so doesn't
have to be justified as a performance enhancement - it's a simple case
of refactoring for clarity. As I've said, we don't do this for heap
tuples and we've heard no complaints in all that time. It was added in
commit fbac1272b89b547dbaacd78bbe8da68e5493cbda, presumably when
problems with system qsorts came to light.

I'd imagined that it might work a bit like default_statistics_target.
Of course, I was just thinking out loud. Also, we do a very good job
on *perfectly* pre-sorted input because we check for pre-sorted input.

We occasionally have people who want to do SELECT * FROM foo WHERE a >
100 AND a < 200 ORDER BY a, b where foo has an index on (a) but not
(a, b). This tends to suck, because we fail to realize that we can
batch the sort. We either seqscan and filter the table then sort the
results, or else we scan the index on (a) and then ignore the fact
that we have data which is already partially sorted. It's
particularly annoying if a is "mostly unique" and needs b only
occasionally to break ties. It would be nice to improve this, but it
may be more of a planner/executor problem that something we can fix
directly inside the sort implementation.

That sounds like an interesting problem. Something to look into for
9.3, perhaps.

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

#12

tgl@sss.pgh.pa.us

about 14 years ago

In reply to: Peter Geoghegan (#11)

Re: Progress on fast path sorting, btree index creation time

Peter Geoghegan <peter@2ndquadrant.com> writes:

I didn't bother isolating that, because it doesn't really make sense
to (not having it is probably only of particular value when doing what
I'm doing anyway, but who knows). Go ahead and commit something to
remove that code (get it in both comparetup_index_btree and
comparetup_index_hash), as well as the tuple1 != tuple2 test now if
you like. It's patently clear that it is unnecessary, and so doesn't
have to be justified as a performance enhancement - it's a simple case
of refactoring for clarity. As I've said, we don't do this for heap
tuples and we've heard no complaints in all that time. It was added in
commit fbac1272b89b547dbaacd78bbe8da68e5493cbda, presumably when
problems with system qsorts came to light.

Actually, I'm going to object to reverting that commit, as I believe
that having equal-keyed index entries in physical table order may offer
some performance benefits at access time. If you don't like the
comment, we can change that.

regards, tom lane

#13

peter@2ndquadrant.com

about 14 years ago

In reply to: Tom Lane (#12)

Re: Progress on fast path sorting, btree index creation time

On 6 January 2012 18:45, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Peter Geoghegan <peter@2ndquadrant.com> writes:

I didn't bother isolating that, because it doesn't really make sense
to (not having it is probably only of particular value when doing what
I'm doing anyway, but who knows). Go ahead and commit something to
remove that code (get it in both comparetup_index_btree and
comparetup_index_hash), as well as the tuple1 != tuple2 test now if
you like. It's patently clear that it is unnecessary, and so doesn't
have to be justified as a performance enhancement - it's a simple case
of refactoring for clarity. As I've said, we don't do this for heap
tuples and we've heard no complaints in all that time. It was added in
commit fbac1272b89b547dbaacd78bbe8da68e5493cbda, presumably when
problems with system qsorts came to light.

Actually, I'm going to object to reverting that commit, as I believe
that having equal-keyed index entries in physical table order may offer
some performance benefits at access time. If you don't like the
comment, we can change that.

Please explain your position. When is this supposed to be useful? Even
if you can present an argument for keeping it, it has to weigh the
fact that people don't tend to have much use for indexes with lots of
duplicates anyway. Prior to 2004, we did not do this - it was
justified as insurance against bad qsort implementations only.

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

#14

tgl@sss.pgh.pa.us

about 14 years ago

In reply to: Peter Geoghegan (#13)

Re: Progress on fast path sorting, btree index creation time

Peter Geoghegan <peter@2ndquadrant.com> writes:

On 6 January 2012 18:45, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Actually, I'm going to object to reverting that commit, as I believe
that having equal-keyed index entries in physical table order may offer
some performance benefits at access time. If you don't like the
comment, we can change that.

Please explain your position. When is this supposed to be useful?

When there are lots of duplicates of a particular indexed value, the
existing code will cause an indexscan to search them in physical order,
whereas if we remove the existing logic it'll be random --- in
particular, accesses to the same heap page can no longer be expected to
be clustered.

Admittedly, I don't have any numbers quantifying just how useful that
might be, but on the other hand you've not presented any evidence
justifying removing the behavior either. If we believe your position
that indexes don't generally have lots of duplicates, then the code in
question will seldom be reached and therefore there would be no
performance benefit in removing it.

regards, tom lane

#15

robertmhaas@gmail.com

about 14 years ago

In reply to: Tom Lane (#14)

Re: Progress on fast path sorting, btree index creation time

On Fri, Jan 6, 2012 at 4:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Admittedly, I don't have any numbers quantifying just how useful that
might be, but on the other hand you've not presented any evidence
justifying removing the behavior either. If we believe your position
that indexes don't generally have lots of duplicates, then the code in
question will seldom be reached and therefore there would be no
performance benefit in removing it.

Obviously, many indexes are unique and thus won't have duplicates at
all. But if someone creates an index and doesn't make it unique, odds
are very high that it has some duplicates. Not sure how many we
typically expect to see, but more than zero...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#16

peter@2ndquadrant.com

about 14 years ago

In reply to: Tom Lane (#14)

1 attachment(s)

Re: Progress on fast path sorting, btree index creation time

On 6 January 2012 21:14, Tom Lane <tgl@sss.pgh.pa.us> wrote:

When there are lots of duplicates of a particular indexed value, the
existing code will cause an indexscan to search them in physical order,
whereas if we remove the existing logic it'll be random --- in
particular, accesses to the same heap page can no longer be expected to
be clustered.

Isn't it possible to get them in physical order anyway, by reading
them into memory in that order? Efficient quick sort implementations
are not stable, and ours is no exception, but we could perhaps come up
with a cheaper tie-breaker value at that stage, if you're determined
to maintain this behaviour. We have sufficient incentive to, as I
describe below.

Admittedly, I don't have any numbers quantifying just how useful that
might be, but on the other hand you've not presented any evidence
justifying removing the behavior either. If we believe your position
that indexes don't generally have lots of duplicates, then the code in
question will seldom be reached and therefore there would be no
performance benefit in removing it.

I ran the same btree benchmark on master, but without the "cheap
insurance". The results were interesting, to say the least.

The columns indexed were the same columns and data that I've been
using all along. Initially this made sense, as the performance of
multi sort key sorts often largely hinged on being able to get away
with doing one comparison per pair of tuples - with many duplicates, I
could avoid cheating and show something closer to worst case for the
patch.

I didn't think it mattered that indexing the same columns would
produce what happened to be a not so useful index in the real world,
due to having so many duplicates - better to have figures that were
somewhat comparable for btree tuple sorting and heap tuple sorting.

When I ran the same benchmark on a server that differed from master
only in that their was no insurance, it momentarily appeared that
*all* of the gains for btree index creation came from being able to
elide the "cheap insurance", but only where it would have to be paid
for a high number of times.

I soon realised that I'd made a blunder: the code (that is, the patch
that I posed most recently) wasn't even using my specialisation for
qsorting, because the SortSupport pointer was null! I did not have
tuplesort_begin_index_btree initialise the SortSupport struct as
tuplesort_begin_heap did, so my earlier benchmark was effectively
meaningless, except that it indicated the benefits of eliding the
cheap insurance alone, if only for that not so compelling case. You
should note that the benefits of not paying for the insurance can be
very significant indeed.

Attached are figures for an identical run of the same btree python
script, but with a version of the patch that actually uses my
specialisations. Granted, these numbers are still partially predicated
on the index in question having a large number of duplicates, but it's
worth noting:

1. The gain from specialisation isn't bad; not as good as the
improvements we saw for heap tuples, but not so bad either, especially
considering that binary bloat should be much less controversial for
btree tuples.

2. The index that results from the tests is still useful; the planner
is perfectly willing to use it rather than than performing an
in-memory sort. It will also use it to satisfy a query like "select *
from orderlines where prod_id = 5", albeit via a bitmap index scan. I
took the precaution of increasing default_statistics_target to its
maximum value, as well an performing an analyze on orderlines in
advance of checking this.

Revision to this patch that fixes the bug to follow - I produced these
new numbers from a rough cut of that.

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

#17

Josh Berkus

josh@agliodbs.com

about 14 years ago

In reply to: Robert Haas (#15)

Re: Progress on fast path sorting, btree index creation time

Obviously, many indexes are unique and thus won't have duplicates at
all. But if someone creates an index and doesn't make it unique, odds
are very high that it has some duplicates. Not sure how many we
typically expect to see, but more than zero...

Peter may not, but I personally admin lots of databases which have
indexes on values like "category" or "city" which have 100's or 1000's
of duplicates per value. I don't think this is uncommon at all.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

#18

peter@2ndquadrant.com

about 14 years ago

In reply to: Josh Berkus (#17)

Re: Progress on fast path sorting, btree index creation time

On 9 January 2012 19:45, Josh Berkus <josh@agliodbs.com> wrote:

Obviously, many indexes are unique and thus won't have duplicates at
all. But if someone creates an index and doesn't make it unique, odds
are very high that it has some duplicates. Not sure how many we
typically expect to see, but more than zero...

Peter may not, but I personally admin lots of databases which have
indexes on values like "category" or "city" which have 100's or 1000's
of duplicates per value. I don't think this is uncommon at all.

Uh, then all the more reason to do what I recommend, I imagine. There
is most definitely a large overhead to creating such indexes, at least
for scalar types. As far as I can tell, Tom's complaint is quite
speculative.

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

#19

peter@2ndquadrant.com

almost 14 years ago

In reply to: Robert Haas (#15)

2 attachment(s)

Re: Progress on fast path sorting, btree index creation time

On 6 January 2012 21:47, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jan 6, 2012 at 4:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Admittedly, I don't have any numbers quantifying just how useful that
might be, but on the other hand you've not presented any evidence
justifying removing the behavior either. If we believe your position
that indexes don't generally have lots of duplicates, then the code in
question will seldom be reached and therefore there would be no
performance benefit in removing it.

I have decided on a tactical retreat in relation to this patch. This
has been dragging on since September. I cannot risk not having the
patch accepted for 9.2, due to trying to handle both heap and btree
tuple sorting at once - the additional relatively modest improvements
for common cases that it will bring to btree index creation time do
not warrant digging my heels in to cover that case in one larger
commit. For that reason, I attach for your consideration a revision of
the patch without any support for btree index tuples whatsoever
(though I have adjusted btree tuplesort comments, and one tiny piece
of code, in a non-controversial way). I'll revisit btree sorting
towards the end of the commitfest, circumstances permitting.

As to the question of binary bloat, I have devised a pgbench-tools
workload that I think will go some way towards reassuring those who
are concerned about its distributed costs. The attached spreadsheet
has all the relevant details.

A custom scripts has been specified (details of which are evident from
benchmark results themselves). If we experienced some kind of
noticeable marginalisation of usefully cached instructions, that's
obviously where it'll show up.

Note that I have taken the preparatory step of updating some tables
with random data, rather than the sequential data that they have by
default, which, unusually for such large tables, can allow us to avail
of our pre-sorted input check optimisation, spoiling things. I did
vacuumdb immediately afterwards, immediately before the pgbench run in
each case.

I've kept the test at 8 clients/8 threads, on a machine with 4
physical cores + hyperthreading for 8 logical ("Intel Core(TM) i7 CPU
870 @ 2.93GHz", with 4 x 32 KB instruction caches, among several
other CPU caches) on a newly-initialised, throw-away database, single
run at scale 50 for 15 minutes in all cases. This is the same server
that I have used throughout.

My analysis is that although there may be a very slight regression in
non-affected queries (it's a tough call to make - how queries happen
to coincide undoubtedly muddies the waters, as does the fact that we
cut lots of time from periods in which backends hold a lot of locally
allocated sort memory - it might actually be a win for those other
queries sometimes but not others, and it appears that the margin
either way is very small), that is more than compensated for by the
benefit of the specialisations. The CPU cache is doing its fjob as
expected, and we clearly see a net benefit from its preference for
cacheing more instructions that are specialised - those sort
operations are way more expensive (if they weren't, it wouldn't cache
them so heavily). I also include results for running the same query
again and again with a single client, to put that effect in context
(though note that previous benchmarks avoiding paying a high client
overhead by using explain analyze, and indeed originally compared
pre-SortSupport Postgres, so these numbers aren't as good either).
It's unusual for a database workload to be so heavily CPU bound, so
I'd suggest that the latter benchmark is more representative than the
former. Either way, we win by some margin. If the queries didn't have
such a high client overhead, as for example with a sort node that
feeds a merge join, we'd do better still.

If a sorting specialisation is never used anyway, the overhead, for
practical purposes, is zero. I believe that sorting specialisations
are just too useful to not be a net win in almost all reasonable
cases. Why haven't I used all specialisations at once, rather than
only two (one for single int4, the other multiple)? Well, I might have
used all of them, but there was no floats available in the pgbench
tables, and besides, the chances of all of them being simultaneously
in play during any sufficiently short period for marginalisation of
CPU cache contents to be of particular concern is, in general, not all
that great.

A major systems programming language, C++, produces multiple
specialisations as its standard library's sort function is used, for
example, each of which will have separate entries in the procedure
linkage table (though various implementation-defined techniques are
frequently used to reduce the number of copies across translation
units at link time). If the programmer specifies either a different
datatype, or a different comparator (through the use of an alternative
to the default std::less_than<T> functor/predicate), a whole new
specialisation is generated by the compiler. This does raise concerns
in relation to binary bloat, but they're reasonably well understood,
and this is the default way of doing things, for general purpose
application development. Now, I know that Postgres isn't written in
C++, and you might not find "C++ does it" to be a particularly
compelling argument, but it does go to show that these ideas are not
by any stretch of the imagination radical.

I remind everyone that the improvement seen in raw qsort_arg runtime
is fairly large, as it would have to be, in order to bring such large
though proportionally smaller improvements to each given affected
query as a whole - there are of course many other sources of overhead
involved in parsing, planning and executing affected queries.

Here is a complete detailing of the specialisations that this latest
revision produces:

int4, single key (supports date too)
int8, multiple keys (supports timestamps too, with and without TZ,
where we HAVE_INT64_TIMESTAMP)
float4, single key
float8, multiple keys
Type-generic single key specialisation. Expected to deliver some of
the same additional benefits for types that do not merit there own
specialisations but would still benefit, like name, int2, etc, but is
used all the time where applicable, even for types like text for which
it is expected that there will be no appreciable benefit.

Thoughts?
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Attachments:

benchmark_fastpath_sort.odsapplication/vnd.oasis.opendocument.spreadsheet; name=benchmark_fastpath_sort.odsDownload

fastpath_sort_2012_01_19.patchtext/x-patch; charset=US-ASCII; name=fastpath_sort_2012_01_19.patchDownload

diff --git a/src/backend/access/nbtree/nbtcompare.c b/src/backend/access/nbtree/nbtcompare.c
new file mode 100644
index fedde93..128376d
*** a/src/backend/access/nbtree/nbtcompare.c
--- b/src/backend/access/nbtree/nbtcompare.c
***************
*** 49,55 ****
  #include "postgres.h"
  
  #include "utils/builtins.h"
- #include "utils/sortsupport.h"
  
  
  Datum
--- 49,54 ----
*************** btint4cmp(PG_FUNCTION_ARGS)
*** 102,108 ****
  		PG_RETURN_INT32(-1);
  }
  
! static int
  btint4fastcmp(Datum x, Datum y, SortSupport ssup)
  {
  	int32		a = DatumGetInt32(x);
--- 101,109 ----
  		PG_RETURN_INT32(-1);
  }
  
! #ifndef USE_INLINE
! 
! int
  btint4fastcmp(Datum x, Datum y, SortSupport ssup)
  {
  	int32		a = DatumGetInt32(x);
*************** btint4fastcmp(Datum x, Datum y, SortSupp
*** 116,121 ****
--- 117,124 ----
  		return -1;
  }
  
+ #endif   /* ! USE_INLINE */
+ 
  Datum
  btint4sortsupport(PG_FUNCTION_ARGS)
  {
*************** btint8cmp(PG_FUNCTION_ARGS)
*** 139,145 ****
  		PG_RETURN_INT32(-1);
  }
  
! static int
  btint8fastcmp(Datum x, Datum y, SortSupport ssup)
  {
  	int64		a = DatumGetInt64(x);
--- 142,150 ----
  		PG_RETURN_INT32(-1);
  }
  
! #ifndef USE_INLINE
! 
! int
  btint8fastcmp(Datum x, Datum y, SortSupport ssup)
  {
  	int64		a = DatumGetInt64(x);
*************** btint8fastcmp(Datum x, Datum y, SortSupp
*** 153,158 ****
--- 158,165 ----
  		return -1;
  }
  
+ #endif   /* ! USE_INLINE */
+ 
  Datum
  btint8sortsupport(PG_FUNCTION_ARGS)
  {
diff --git a/src/backend/utils/adt/float.c b/src/backend/utils/adt/float.c
new file mode 100644
index 08434dc..2265f3e
*** a/src/backend/utils/adt/float.c
--- b/src/backend/utils/adt/float.c
***************
*** 23,29 ****
  #include "libpq/pqformat.h"
  #include "utils/array.h"
  #include "utils/builtins.h"
- #include "utils/sortsupport.h"
  
  
  #ifndef M_PI
--- 23,28 ----
*************** do {															\
*** 67,76 ****
  /* Configurable GUC parameter */
  int			extra_float_digits = 0;		/* Added to DBL_DIG or FLT_DIG */
  
- 
- static int	float4_cmp_internal(float4 a, float4 b);
- static int	float8_cmp_internal(float8 a, float8 b);
- 
  #ifndef HAVE_CBRT
  /*
   * Some machines (in particular, some versions of AIX) have an extern
--- 66,71 ----
*************** float8div(PG_FUNCTION_ARGS)
*** 844,850 ****
  /*
   *		float4{eq,ne,lt,le,gt,ge}		- float4/float4 comparison operations
   */
! static int
  float4_cmp_internal(float4 a, float4 b)
  {
  	/*
--- 839,847 ----
  /*
   *		float4{eq,ne,lt,le,gt,ge}		- float4/float4 comparison operations
   */
! #ifndef USE_INLINE
! 
! int
  float4_cmp_internal(float4 a, float4 b)
  {
  	/*
*************** float4_cmp_internal(float4 a, float4 b)
*** 874,879 ****
--- 871,878 ----
  	}
  }
  
+ #endif   /* ! USE_INLINE */
+ 
  Datum
  float4eq(PG_FUNCTION_ARGS)
  {
*************** btfloat4cmp(PG_FUNCTION_ARGS)
*** 937,943 ****
  	PG_RETURN_INT32(float4_cmp_internal(arg1, arg2));
  }
  
! static int
  btfloat4fastcmp(Datum x, Datum y, SortSupport ssup)
  {
  	float4		arg1 = DatumGetFloat4(x);
--- 936,945 ----
  	PG_RETURN_INT32(float4_cmp_internal(arg1, arg2));
  }
  
! 
! #ifndef USE_INLINE
! 
! int
  btfloat4fastcmp(Datum x, Datum y, SortSupport ssup)
  {
  	float4		arg1 = DatumGetFloat4(x);
*************** btfloat4fastcmp(Datum x, Datum y, SortSu
*** 946,951 ****
--- 948,955 ----
  	return float4_cmp_internal(arg1, arg2);
  }
  
+ #endif   /* ! USE_INLINE */
+ 
  Datum
  btfloat4sortsupport(PG_FUNCTION_ARGS)
  {
*************** btfloat4sortsupport(PG_FUNCTION_ARGS)
*** 958,964 ****
  /*
   *		float8{eq,ne,lt,le,gt,ge}		- float8/float8 comparison operations
   */
! static int
  float8_cmp_internal(float8 a, float8 b)
  {
  	/*
--- 962,971 ----
  /*
   *		float8{eq,ne,lt,le,gt,ge}		- float8/float8 comparison operations
   */
! 
! #ifndef USE_INLINE
! 
! int
  float8_cmp_internal(float8 a, float8 b)
  {
  	/*
*************** float8_cmp_internal(float8 a, float8 b)
*** 988,993 ****
--- 995,1002 ----
  	}
  }
  
+ #endif   /* ! USE_INLINE */
+ 
  Datum
  float8eq(PG_FUNCTION_ARGS)
  {
*************** btfloat8cmp(PG_FUNCTION_ARGS)
*** 1051,1057 ****
  	PG_RETURN_INT32(float8_cmp_internal(arg1, arg2));
  }
  
! static int
  btfloat8fastcmp(Datum x, Datum y, SortSupport ssup)
  {
  	float8		arg1 = DatumGetFloat8(x);
--- 1060,1068 ----
  	PG_RETURN_INT32(float8_cmp_internal(arg1, arg2));
  }
  
! #ifndef USE_INLINE
! 
! int
  btfloat8fastcmp(Datum x, Datum y, SortSupport ssup)
  {
  	float8		arg1 = DatumGetFloat8(x);
*************** btfloat8fastcmp(Datum x, Datum y, SortSu
*** 1060,1065 ****
--- 1071,1078 ----
  	return float8_cmp_internal(arg1, arg2);
  }
  
+ #endif   /* ! USE_INLINE */
+ 
  Datum
  btfloat8sortsupport(PG_FUNCTION_ARGS)
  {
diff --git a/src/backend/utils/sort/sortsupport.c b/src/backend/utils/sort/sortsupport.c
new file mode 100644
index 7f388fd..39d41a2
*** a/src/backend/utils/sort/sortsupport.c
--- b/src/backend/utils/sort/sortsupport.c
***************
*** 16,21 ****
--- 16,22 ----
  #include "postgres.h"
  
  #include "fmgr.h"
+ #include "utils/fmgroids.h"
  #include "utils/lsyscache.h"
  #include "utils/sortsupport.h"
  
*************** PrepareSortSupportComparisonShim(Oid cmp
*** 127,132 ****
--- 128,181 ----
  }
  
  /*
+  * Given a pg_proc sort support function, see if it's safe to sort using a
+  * custom specialization with a hard-coded generic comparator for scalar types,
+  * to faciliate an optimization that may later be used when sorting.
+  *
+  * This is actually technically quite orthogonal to the user-exposed API for
+  * providing a directly callable comparator to elide the SQL function call
+  * machinery, though that API may separately improve performance for the same
+  * sort under certain circumstances, as when a sort must be performed with
+  * multiple keys of different types, only the first of which has a comparator
+  * representation that we consider here. However, it is reasonable to map
+  * built-in sort support functions to a corresponding hard coded comparator
+  * representation, because it is always expected that the types that this
+  * optimization will be used with will be a subset of types with a built-in sort
+  * support function - if it wasn't worth doing the former due to the fact that
+  * the innate cost of each comparison marginalised any benefit, it certainly
+  * won't be worth doing the latter.  It is also reasonable because the fast
+  * comparator variants are, as a matter of policy, directly used for this
+  * optimization, so ipso facto a corresponding sort support function must be
+  * available.
+  */
+ TypeCompar
+ ResolveComparatorProper(Oid Proc)
+ {
+ 	switch (Proc)
+ 	{
+ 		/* Some comparison that has an underlying int4 representation */
+ 		case F_BTINT4SORTSUPPORT:
+ 			return TYPE_COMP_INT4;
+ 		case F_DATE_SORTSUPPORT:
+ 			return TYPE_COMP_INT4;
+ 		/* Some comparison that has an underlying int8 representation */
+ 		case F_BTINT8SORTSUPPORT:
+ 			return TYPE_COMP_INT8;
+ #ifdef HAVE_INT64_TIMESTAMP
+ 		case F_TIMESTAMP_SORTSUPPORT:
+ 			return TYPE_COMP_INT8;
+ #endif
+ 		/* floating point types */
+ 		case F_BTFLOAT4SORTSUPPORT:
+ 			return TYPE_COMP_FLOAT4;
+ 		case F_BTFLOAT8SORTSUPPORT:
+ 			return TYPE_COMP_FLOAT8;
+ 		default:
+ 			return TYPE_COMP_OTHER;
+ 	}
+ }
+ 
+ /*
   * Fill in SortSupport given an ordering operator (btree "<" or ">" operator).
   *
   * Caller must previously have zeroed the SortSupportData structure and then
*************** PrepareSortSupportFromOrderingOp(Oid ord
*** 157,160 ****
--- 206,214 ----
  		/* We'll use a shim to call the old-style btree comparator */
  		PrepareSortSupportComparisonShim(sortFunction, ssup);
  	}
+ 	/*
+ 	 * We may later avail of a further optimization for a few built-in scalar
+ 	 * types: inlining of the comparator proper.
+ 	 */
+ 	ssup->usable_compar = ResolveComparatorProper(sortFunction);
  }
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index 4c2fe69..ea0164c
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
***************
*** 106,118 ****
--- 106,121 ----
  #include "executor/executor.h"
  #include "miscadmin.h"
  #include "pg_trace.h"
+ #include "utils/builtins.h"
  #include "utils/datum.h"
+ #include "utils/fmgroids.h"
  #include "utils/logtape.h"
  #include "utils/lsyscache.h"
  #include "utils/memutils.h"
  #include "utils/pg_rusage.h"
  #include "utils/rel.h"
  #include "utils/sortsupport.h"
+ #include "utils/template_qsort_arg.h"
  #include "utils/tuplesort.h"
  
  
*************** static void tuplesort_heap_insert(Tuples
*** 456,461 ****
--- 459,465 ----
  static void tuplesort_heap_siftup(Tuplesortstate *state, bool checkIndex);
  static unsigned int getlen(Tuplesortstate *state, int tapenum, bool eofOK);
  static void markrunend(Tuplesortstate *state, int tapenum);
+ static inline int comparetup_inline(void *a, void *b, Tuplesortstate *state);
  static int comparetup_heap(const SortTuple *a, const SortTuple *b,
  				Tuplesortstate *state);
  static void copytup_heap(Tuplesortstate *state, SortTuple *stup, void *tup);
*************** static void readtup_datum(Tuplesortstate
*** 492,497 ****
--- 496,537 ----
  static void reversedirection_datum(Tuplesortstate *state);
  static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
  
+ /*
+  * Manufacture type specific sorting specialisations with inline comparators.
+  *
+  * Use fast comparator functions for "comparator proper" in each case. They're
+  * generally used via function pointers to elide SQL-function-call indirection
+  * when resolving the comparator of a btree index opclass for sorting
+  * (historically, tuplesort went through pg_proc in all cases) but that doesn't
+  * provide the additional benefits of inlining and other optimizations from
+  * compile-time comparator knowledge, through the generation of complete
+  * per-type sorting specializations.
+  */
+ #define int4_shim(x,y)			btint4fastcmp(x, y, NULL)
+ #define int8_shim(x,y)			btint8fastcmp(x, y, NULL)
+ #define float4_shim(x,y)		btfloat4fastcmp(x, y, NULL)
+ #define float8_shim(x,y)		btfloat8fastcmp(x, y, NULL)
+ 
+ #define noop_cmp_proper(nop1, nop2) 0
+ #define noop_code (void)0;
+ 
+ /* Instantiate sorting specializations (for heap tuples only) */
+ TEMPLATE_QSORT_ARG_HEAP(int4, int4_shim);
+ TEMPLATE_QSORT_ARG_HEAP(int8, int8_shim);
+ TEMPLATE_QSORT_ARG_HEAP(float4, float4_shim);
+ TEMPLATE_QSORT_ARG_HEAP(float8, float8_shim);
+ 
+ /*
+  * A type-neutral specialization, for single sort-key sorts not covered by other
+  * specializations. This specialization could be used for sorting both heap and
+  * index tuples that meet that criteria. However, it is currently only used for
+  * heap tuples.
+  *
+  * This could result in dead code, but since the redundant functions generated
+  * are all static, they will almost invariably be discarded by compiler
+  * optimizations performed by all modern compilers.
+  */
+ DO_TEMPLATE_QSORT_ARG(all, noop_cmp_proper, sing, noop_code, comparetup_inline);
  
  /*
   *		tuplesort_begin_xxx
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 631,636 ****
--- 671,704 ----
  		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
  	}
  
+ 	if (state->sortKeys)
+ 	{
+ 		switch(state->sortKeys->usable_compar)
+ 		{
+ 			case TYPE_COMP_INT4:
+ 				state->sortKeys->qsort_arg_spec =
+ 					nkeys==1?int4_inlheap_qsort_arg:int4_regheap_qsort_arg;
+ 				break;
+ 			case TYPE_COMP_INT8:
+ 				state->sortKeys->qsort_arg_spec =
+ 					nkeys==1?int8_inlheap_qsort_arg:int8_regheap_qsort_arg;
+ 				break;
+ 			case TYPE_COMP_FLOAT4:
+ 				state->sortKeys->qsort_arg_spec =
+ 					nkeys==1?float4_inlheap_qsort_arg:float4_regheap_qsort_arg;
+ 				break;
+ 			case TYPE_COMP_FLOAT8:
+ 				state->sortKeys->qsort_arg_spec =
+ 					nkeys==1?float8_inlheap_qsort_arg:float8_regheap_qsort_arg;
+ 				break;
+ 			case TYPE_COMP_OTHER:
+ 				state->sortKeys->qsort_arg_spec =
+ 					nkeys==1?all_sing_qsort_arg:NULL;
+ 				break;
+ 			default:
+ 				elog(ERROR, "unrecognized comparator type for heap tuplesort");
+ 		}
+ 	}
  	MemoryContextSwitchTo(oldcontext);
  
  	return state;
*************** tuplesort_performsort(Tuplesortstate *st
*** 1221,1232 ****
  			 * We were able to accumulate all the tuples within the allowed
  			 * amount of memory.  Just qsort 'em and we're done.
  			 */
  			if (state->memtupcount > 1)
! 				qsort_arg((void *) state->memtuples,
! 						  state->memtupcount,
! 						  sizeof(SortTuple),
! 						  (qsort_arg_comparator) state->comparetup,
! 						  (void *) state);
  			state->current = 0;
  			state->eof_reached = false;
  			state->markpos_offset = 0;
--- 1289,1317 ----
  			 * We were able to accumulate all the tuples within the allowed
  			 * amount of memory.  Just qsort 'em and we're done.
  			 */
+ 
  			if (state->memtupcount > 1)
! 			{
! 				/* Use a sorting specialization if available */
! 				if (state->sortKeys && state->sortKeys->qsort_arg_spec)
! 					/* specialization available */
! 					state->sortKeys->qsort_arg_spec(
! 								(void *) state->memtuples,
! 								state->memtupcount,
! 								sizeof(SortTuple),
! 								(void *) state);
! 				else
! 					/*
! 					 * Fall back on regular qsort_arg, with function pointer
! 					 * comparator, making no compile-time assumptions about the
! 					 * number of sortkeys or the datatype(s) to be sorted.
! 					 */
! 					qsort_arg((void *) state->memtuples,
! 								  state->memtupcount,
! 								  sizeof(SortTuple),
! 								  (qsort_arg_comparator) state->comparetup,
! 								  (void *) state);
! 			}
  			state->current = 0;
  			state->eof_reached = false;
  			state->markpos_offset = 0;
*************** inlineApplySortFunction(FmgrInfo *sortFu
*** 2645,2653 ****
--- 2730,2766 ----
  	return compare;
  }
  
+ /*
+  * This is a cut-down duplicate of comparetup_heap that exists for the express
+  * purpose of generating a sorting specialization in which it is inlined, as a
+  * performance optimization. It does not make any assumptions about the
+  * representation of the tuple proper (it could equally well be a heap or index
+  * tuple sort, though only heap tuples are currently supported).
+  *
+  * This is only possible for sorts with a single sortKey. Note that the
+  * comparator proper is not inlined, so this is only a partial specialization.
+  */
+ static inline int
+ comparetup_inline(void *a, void *b, Tuplesortstate *state)
+ {
+ 	/* void* parameters are used to shut-up the compiler */
+ 	const SortTuple *aT = (const SortTuple*) a, *bT = (const SortTuple*) b;
+ 	/* Allow interrupting long sorts */
+ 	CHECK_FOR_INTERRUPTS();
+ 
+ 	Assert(state->nKeys == 1);
+ 
+ 	/* Compare the one and only sort key with minimal indirection */
+ 	return ApplySortComparator(aT->datum1, aT->isnull1,
+ 								  bT->datum1, bT->isnull1,
+ 								  state->sortKeys);
+ }
  
  /*
   * Routines specialized for HeapTuple (actually MinimalTuple) case
+  *
+  * This code is partially duplicated within template_qsort_arg.h. Ensure that
+  * they are kept consistent.
   */
  
  static int
*************** comparetup_index_btree(const SortTuple *
*** 2981,2990 ****
  					   Tuplesortstate *state)
  {
  	/*
! 	 * This is similar to _bt_tuplecompare(), but we have already done the
! 	 * index_getattr calls for the first column, and we need to keep track of
! 	 * whether any null fields are present.  Also see the special treatment
! 	 * for equal keys at the end.
  	 */
  	ScanKey		scanKey = state->indexScanKey;
  	IndexTuple	tuple1;
--- 3094,3102 ----
  					   Tuplesortstate *state)
  {
  	/*
! 	 * We have already done the index_getattr calls for the first column, and we
! 	 * need to keep track of whether any null fields are present.  Also see the
! 	 * special treatment for equal keys at the end.
  	 */
  	ScanKey		scanKey = state->indexScanKey;
  	IndexTuple	tuple1;
*************** comparetup_index_btree(const SortTuple *
*** 3046,3063 ****
  	 * they *must* get compared at some stage of the sort --- otherwise the
  	 * sort algorithm wouldn't have checked whether one must appear before the
  	 * other.
- 	 *
- 	 * Some rather brain-dead implementations of qsort will sometimes call the
- 	 * comparison routine to compare a value to itself.  (At this writing only
- 	 * QNX 4 is known to do such silly things; we don't support QNX anymore,
- 	 * but perhaps the behavior still exists elsewhere.)  Don't raise a bogus
- 	 * error in that case.
  	 */
! 	if (state->enforceUnique && !equal_hasnull && tuple1 != tuple2)
  	{
  		Datum		values[INDEX_MAX_KEYS];
  		bool		isnull[INDEX_MAX_KEYS];
  
  		index_deform_tuple(tuple1, tupDes, values, isnull);
  		ereport(ERROR,
  				(errcode(ERRCODE_UNIQUE_VIOLATION),
--- 3158,3178 ----
  	 * they *must* get compared at some stage of the sort --- otherwise the
  	 * sort algorithm wouldn't have checked whether one must appear before the
  	 * other.
  	 */
! 	if (state->enforceUnique && !equal_hasnull)
  	{
  		Datum		values[INDEX_MAX_KEYS];
  		bool		isnull[INDEX_MAX_KEYS];
  
+ 		/*
+ 		 * In the past, it was conceivable that we'd have to protect against
+ 		 * a comparison of a tuple to itself, because we used the system
+ 		 * qsort(), and as such had to assume the worst about the implementation's
+ 		 * bogosity. This is no longer the case, though this assertion serves to
+ 		 * prevent that problem from re-emerging.
+ 		 */
+ 		Assert(tuple1 != tuple2);
+ 
  		index_deform_tuple(tuple1, tupDes, values, isnull);
  		ereport(ERROR,
  				(errcode(ERRCODE_UNIQUE_VIOLATION),
*************** comparetup_index_btree(const SortTuple *
*** 3070,3078 ****
  
  	/*
  	 * If key values are equal, we sort on ItemPointer.  This does not affect
! 	 * validity of the finished index, but it offers cheap insurance against
! 	 * performance problems with bad qsort implementations that have trouble
! 	 * with large numbers of equal keys.
  	 */
  	{
  		BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
--- 3185,3192 ----
  
  	/*
  	 * If key values are equal, we sort on ItemPointer.  This does not affect
! 	 * validity of the finished index, but it may be useful to have index scans
! 	 * in physical order.
  	 */
  	{
  		BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
*************** comparetup_index_hash(const SortTuple *a
*** 3120,3128 ****
  
  	/*
  	 * If hash values are equal, we sort on ItemPointer.  This does not affect
! 	 * validity of the finished index, but it offers cheap insurance against
! 	 * performance problems with bad qsort implementations that have trouble
! 	 * with large numbers of equal keys.
  	 */
  	tuple1 = (IndexTuple) a->tuple;
  	tuple2 = (IndexTuple) b->tuple;
--- 3234,3241 ----
  
  	/*
  	 * If hash values are equal, we sort on ItemPointer.  This does not affect
! 	 * validity of the finished index, but it may be useful to have index scans
! 	 * in physical order.
  	 */
  	tuple1 = (IndexTuple) a->tuple;
  	tuple2 = (IndexTuple) b->tuple;
diff --git a/src/include/c.h b/src/include/c.h
new file mode 100644
index 7396adb..939f49a
*** a/src/include/c.h
--- b/src/include/c.h
*************** extern int	fdatasync(int fildes);
*** 850,853 ****
--- 850,865 ----
  /* /port compatibility functions */
  #include "port.h"
  
+ /*
+  * Define a cross-platform "always-inline" macro. This is a very sharp tool that
+  * should be used judiciously.
+  */
+ #ifdef __always_inline
+ #define pg_always_inline __always_inline
+ #elif defined(__force_inline)
+ #define pg_always_inline __force_inline
+ #else
+ #define pg_always_inline inline
+ #endif
+ 
  #endif   /* C_H */
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
new file mode 100644
index 46b2f3b..c8c678c
*** a/src/include/utils/builtins.h
--- b/src/include/utils/builtins.h
***************
*** 14,21 ****
--- 14,24 ----
  #ifndef BUILTINS_H
  #define BUILTINS_H
  
+ #include <math.h>
+ 
  #include "fmgr.h"
  #include "nodes/parsenodes.h"
+ #include "utils/sortsupport.h"
  
  /*
   *		Defined in adt/
*************** extern Datum btfloat8sortsupport(PG_FUNC
*** 323,328 ****
--- 326,459 ----
  extern Datum btoidsortsupport(PG_FUNCTION_ARGS);
  extern Datum btnamesortsupport(PG_FUNCTION_ARGS);
  
+ /*
+  *		Expose some "fast" variants of per-opclass comparison functions that
+  *		will be returned by some of the above sort-support functions. Do so to
+  *		avail of optimizations from compile-time knowledge of comparators,
+  *		including but not limited to inlining, both for its innate value in
+  *		improving locality of reference, and its value as an enabling
+  *		transformation.
+  */
+ 
+ #ifdef USE_INLINE
+ 
+ static inline int
+ btint4fastcmp(Datum x, Datum y, SortSupport ssup)
+ {
+ 	int32		a = DatumGetInt32(x);
+ 	int32		b = DatumGetInt32(y);
+ 
+ 	if (a > b)
+ 		return 1;
+ 	else if (a == b)
+ 		return 0;
+ 	else
+ 		return -1;
+ }
+ 
+ static inline int
+ btint8fastcmp(Datum x, Datum y, SortSupport ssup)
+ {
+ 	int64		a = DatumGetInt64(x);
+ 	int64		b = DatumGetInt64(y);
+ 
+ 	if (a > b)
+ 		return 1;
+ 	else if (a == b)
+ 		return 0;
+ 	else
+ 		return -1;
+ }
+ 
+ static inline int
+ float4_cmp_internal(float4 a, float4 b)
+ {
+ 	/*
+ 	 * We consider all NANs to be equal and larger than any non-NAN. This is
+ 	 * somewhat arbitrary; the important thing is to have a consistent sort
+ 	 * order.
+ 	 */
+ 	if (isnan(a))
+ 	{
+ 		if (isnan(b))
+ 			return 0;			/* NAN = NAN */
+ 		else
+ 			return 1;			/* NAN > non-NAN */
+ 	}
+ 	else if (isnan(b))
+ 	{
+ 		return -1;				/* non-NAN < NAN */
+ 	}
+ 	else
+ 	{
+ 		if (a > b)
+ 			return 1;
+ 		else if (a < b)
+ 			return -1;
+ 		else
+ 			return 0;
+ 	}
+ }
+ 
+ static inline int
+ float8_cmp_internal(float8 a, float8 b)
+ {
+ 	/*
+ 	 * We consider all NANs to be equal and larger than any non-NAN. This is
+ 	 * somewhat arbitrary; the important thing is to have a consistent sort
+ 	 * order.
+ 	 */
+ 	if (isnan(a))
+ 	{
+ 		if (isnan(b))
+ 			return 0;			/* NAN = NAN */
+ 		else
+ 			return 1;			/* NAN > non-NAN */
+ 	}
+ 	else if (isnan(b))
+ 	{
+ 		return -1;				/* non-NAN < NAN */
+ 	}
+ 	else
+ 	{
+ 		if (a > b)
+ 			return 1;
+ 		else if (a < b)
+ 			return -1;
+ 		else
+ 			return 0;
+ 	}
+ }
+ 
+ static inline int
+ btfloat4fastcmp(Datum x, Datum y, SortSupport ssup)
+ {
+ 	float4		arg1 = DatumGetFloat4(x);
+ 	float4		arg2 = DatumGetFloat4(y);
+ 
+ 	return float4_cmp_internal(arg1, arg2);
+ }
+ 
+ static inline int
+ btfloat8fastcmp(Datum x, Datum y, SortSupport ssup)
+ {
+ 	float8		arg1 = DatumGetFloat8(x);
+ 	float8		arg2 = DatumGetFloat8(y);
+ 
+ 	return float8_cmp_internal(arg1, arg2);
+ }
+ 
+ #else
+ 
+ extern int btint4fastcmp(Datum x, Datum y, SortSupport ssup);
+ extern int btint8fastcmp(Datum x, Datum y, SortSupport ssup);
+ extern int float4_cmp_internal(float4 a, float4 b);
+ extern int float8_cmp_internal(float8 a, float8 b);
+ extern int btfloat4fastcmp(Datum x, Datum y, SortSupport ssup);
+ extern int btfloat8fastcmp(Datum x, Datum y, SortSupport ssup);
+ 
+ #endif   /* USE_INLINE */
+ 
  /* float.c */
  extern PGDLLIMPORT int extra_float_digits;
  
diff --git a/src/include/utils/sortsupport.h b/src/include/utils/sortsupport.h
new file mode 100644
index ef8d853..822ee6e
*** a/src/include/utils/sortsupport.h
--- b/src/include/utils/sortsupport.h
***************
*** 31,36 ****
--- 31,42 ----
   * data can be stored using the ssup_extra field.  Any such data
   * should be allocated in the ssup_cxt memory context.
   *
+  * In addition to providing lower-overhead comparators, this infrastructure also
+  * resolves a generic comparator for the type, if one is available from a static
+  * set, each of which corresponds to one or more types.  These are available
+  * only for a subset of built-in sort support functions.  This enables more
+  * aggressive per-type optimizations at sort time.
+  *
   * Note: since pg_amproc functions are indexed by (lefttype, righttype)
   * it is possible to associate a BTSORTSUPPORT function with a cross-type
   * comparison.  This could sensibly be used to provide a fast comparator
***************
*** 49,54 ****
--- 55,87 ----
  
  #include "access/attnum.h"
  
+ /*
+  * Which comparator representation can be used for this type? It is acceptable
+  * to use an int4 comparator with the date datatype for example, because both
+  * types have the same underlying representation. In particular, their
+  * comparators are interchangeable.
+  *
+  * It is incorrect to do this with types that share a certain bitwise
+  * representation with some scalar type, represented by an enum constant here,
+  * when they do not have interchangeable comparators. For example, sortkeys of
+  * legacy float8 representation of timestamps will not be set to
+  * TYPE_COMP_FLOAT8, because its comparator has special handling of NaN values.
+  *
+  * To an even greater extent than with the optimization whereby a btree index
+  * opclass provides a "fast" comparator to elide SQL function call overhead,
+  * it's not useful to do this with types that have inherently high-overhead
+  * comparisons, of greater than a few instructions; the cost of the comparison
+  * itself is expected to dominate, marginalizing any benefit.
+  */
+ typedef enum TypeCompar
+ {
+ 	TYPE_COMP_INT4,
+ 	TYPE_COMP_INT8,
+ 	TYPE_COMP_FLOAT4,
+ 	TYPE_COMP_FLOAT8,
+ 	TYPE_COMP_OTHER
+ } TypeCompar;
+ 
  typedef struct SortSupportData *SortSupport;
  
  typedef struct SortSupportData
*************** typedef struct SortSupportData
*** 96,103 ****
  	int			(*comparator) (Datum x, Datum y, SortSupport ssup);
  
  	/*
! 	 * Additional sort-acceleration functions might be added here later.
  	 */
  } SortSupportData;
  
  
--- 129,157 ----
  	int			(*comparator) (Datum x, Datum y, SortSupport ssup);
  
  	/*
! 	 * This specialization function pointer is sometimes used as an alternative
! 	 * to the standard qsort_arg, when it has been determined that we can
! 	 * benefit from various per-type and per number of sort key performance
! 	 * optimizations.
! 	 *
! 	 * Often, this will simply entail using a variant that inlines the comparator
! 	 * that was previously used (through a pointer) by the highly generic
! 	 * qsort_arg, such as comparetup_heap, which encapsulate the details of
! 	 * performing a comparison on the class of tuple in question, and are where
! 	 * the call is made to MySortSupport->comparator.
! 	 *
! 	 * In some other cases, particularly with scalar datatypes that are assumed
! 	 * to be sorted far more frequently in practice, the specialization goes so
! 	 * far as to inline the comparator proper from within the tuple-class
! 	 * encapsulating comparator.
! 	 */
! 	void (*qsort_arg_spec)(void *a, size_t n, size_t es, void *arg);
! 
! 	/*
! 	 * Which type-specific comparator proper (which is part of a fully inlined
! 	 * specialization) can be safely used, if any?
  	 */
+ 	TypeCompar usable_compar;
  } SortSupportData;
  
  
*************** extern int	ApplySortComparator(Datum dat
*** 151,156 ****
--- 205,211 ----
  
  /* Other functions in utils/sort/sortsupport.c */
  extern void PrepareSortSupportComparisonShim(Oid cmpFunc, SortSupport ssup);
+ extern TypeCompar ResolveComparatorProper(Oid orderingOp);
  extern void PrepareSortSupportFromOrderingOp(Oid orderingOp, SortSupport ssup);
  
  #endif   /* SORTSUPPORT_H */
diff --git a/src/include/utils/template_qsort_arg.h b/src/include/utils/template_qsort_arg.h
new file mode 100644
index ...6e9ef83
*** a/src/include/utils/template_qsort_arg.h
--- b/src/include/utils/template_qsort_arg.h
***************
*** 0 ****
--- 1,316 ----
+ /*-------------------------------------------------------------------------
+  *	template_qsort_arg.h: "template" version of qsort_arg.c
+  *
+  *	This version of qsort_arg is exclusively used within tuplesort.c to	more
+  *	efficiently sort common types such as integers and floats. In providing this
+  *	version, we seek to take advantage of compile-time	optimizations, in
+  *	particular, the inlining of comparators. In some cases the "tuple class"
+  *	encapsulating comparator (that is, the particular comparator used directly
+  *	here) is just inlined, when a full specialization is unavailable. In other
+  *	cases, the comparator proper is also inlined, for full integration of the
+  *	sort routine. There are variant specializations for cases with only a single
+  *	sortkey, and cases with multiple sortkeys.
+  *
+  *	The TEMPLATE_QSORT_ARG() macro generates an inlining variant (for sorts with
+  *	a single sortKey) and non-inlining variant for sorts with multiple sortKeys.
+  *
+  *	We rely on various function declarations, and indeed partially duplicate
+  *	some code from tuplesort.c, so this file should be considered private to
+  *	that module, rather than a generic piece of infrastructure.
+  *
+  *	CAUTION: if you change this file, see also qsort_arg.c as well as qsort.c.
+  *	qsort_arg.c should be considered authoritative.
+  *
+  *	Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group
+  *	Portions Copyright (c) 1994, Regents of the University of California
+  *
+  *	src/include/utils/template_qsort_arg.h
+  *-------------------------------------------------------------------------
+  */
+ #include "c.h"
+ 
+ #define swapcode(TYPE, parmi, parmj, n)		\
+ do {										\
+ 	size_t i = (n) / sizeof (TYPE);			\
+ 	TYPE *pi = (TYPE *)(void *)(parmi);		\
+ 	TYPE *pj = (TYPE *)(void *)(parmj);		\
+ 	do {									\
+ 		TYPE	t = *pi;					\
+ 		*pi++ = *pj;						\
+ 		*pj++ = t;							\
+ 		} while (--i > 0);					\
+ } while (0)
+ 
+ #define SWAPINIT(a, es) swaptype = ((char *)(a) - (char *)0) % sizeof(long) || \
+ 	(es) % sizeof(long) ? 2 : (es) == sizeof(long)? 0 : 1;
+ 
+ #define thistype_vecswap(a, b, n)								\
+ 	if ((n) > 0) inl_swapfunc((a), (b), (size_t)(n), swaptype)
+ 
+ #define thistype_swap(a, b)								\
+ if (swaptype == 0) {									\
+ 		long t = *(long *)(void *)(a);					\
+ 		*(long *)(void *)(a) = *(long *)(void *)(b);	\
+ 		*(long *)(void *)(b) = t;						\
+ 	} else												\
+ 		inl_swapfunc(a, b, es, swaptype)
+ 
+ inline static void
+ inl_swapfunc(char *a, char *b, size_t n, int swaptype)
+ {
+ 	if (swaptype <= 1)
+ 		swapcode(long, a, b, n);
+ 	else
+ 		swapcode(char, a, b, n);
+ }
+ 
+ /*
+  * This macro manufactures a type-specific implementation of qsort_arg with
+  * the comparator, COMPAR, known at compile time. COMPAR is typically an
+  * inline function.
+  *
+  * COMPAR should take as its arguments two Datums, and return an int, in
+  * line with standard qsort convention.
+  *
+  * We have void* parameters for TYPE##comparetup_inline just to shut up the compiler.
+  * They could be SortTuple pointers instead, but that would make it more
+  * difficult to keep template_qsort_arg.h consistent with tuplesort.c.
+  */
+ 
+ #define DO_TEMPLATE_QSORT_ARG(TYPE, COMPAR, SPEC_VAR, ADD_CODE, OUT_COMPAR)			\
+ void TYPE##_##SPEC_VAR##_qsort_arg(void *a, size_t n, size_t es, void *arg);		\
+ 																					\
+ inline static int32																	\
+ TYPE##SPEC_VAR##AppFunc(Datum datum1, bool isnull1, Datum datum2, bool isnull2,		\
+ 							SortSupport	sortKey)									\
+ {																					\
+ 	int32		compare;															\
+ 	if (isnull1)																	\
+ 	{																				\
+ 		if (isnull2)																\
+ 			compare = 0;		/* NULL "=" NULL */									\
+ 		else if (sortKey->ssup_nulls_first)											\
+ 			compare = -1;		/* NULL "<" NOT_NULL */								\
+ 		else																		\
+ 			compare = 1;		/* NULL ">" NOT_NULL */								\
+ 	}																				\
+ 	else if (isnull2)																\
+ 	{																				\
+ 		if (sortKey->ssup_nulls_first)												\
+ 			compare = 1;		/* NOT_NULL ">" NULL */								\
+ 		else																		\
+ 			compare = -1;		/* NOT_NULL "<" NULL */								\
+ 	}																				\
+ 	else																			\
+ 	{																				\
+ 		compare = COMPAR(datum1, datum2);											\
+ 																					\
+ 		if (sortKey->ssup_reverse)													\
+ 			compare = -compare;														\
+ 	}																				\
+ 	return compare;																	\
+ }																					\
+ 																					\
+ /*																					\
+  * Note that this is heavily based on comparetup_inline; the two should be kept		\
+  * consistent.																		\
+  */																					\
+ pg_always_inline static int															\
+ TYPE##SPEC_VAR##comparetup_inline(const void *a, const void *b,						\
+ 		Tuplesortstate *state)														\
+ {																					\
+ 	const SortTuple* aT = a;														\
+ 	const SortTuple* bT = b;														\
+ 	int32		compare;															\
+ 	SortSupport	sortKey = state->sortKeys;											\
+ 																					\
+ 	/* Allow interrupting long sorts */												\
+ 	CHECK_FOR_INTERRUPTS();															\
+ 	compare = TYPE##SPEC_VAR##AppFunc(aT->datum1, aT->isnull1, bT->datum1,			\
+ 										bT->isnull1, sortKey);						\
+ 	if (compare != 0)																\
+ 		return compare;																\
+ 	/* Additional code for variants with more than one sortkey */					\
+ 	ADD_CODE																		\
+ 	return 0;																		\
+ }																					\
+ 																					\
+ inline static char *											\
+ TYPE##SPEC_VAR##med3(char *a, char *b, char *c, void *arg)		\
+ {																\
+ 	return OUT_COMPAR(a, b, arg) < 0 ?							\
+ 		(OUT_COMPAR(b, c, arg) < 0 ?							\
+ 					b : (OUT_COMPAR(a, c, arg) < 0 ? c : a))	\
+ 		: (OUT_COMPAR(b, c, arg) > 0 ?							\
+ 					b : (OUT_COMPAR(a, c, arg) < 0 ? a : c));	\
+ }																\
+ 																\
+ void																		\
+ TYPE##_##SPEC_VAR##_qsort_arg(void *a, size_t n, size_t es, void *arg)		\
+ {																			\
+ 	char	   *pa,															\
+ 			   *pb,															\
+ 			   *pc,															\
+ 			   *pd,															\
+ 			   *pl,															\
+ 			   *pm,															\
+ 			   *pn;															\
+ 	int			d,															\
+ 				r,															\
+ 				swaptype,													\
+ 				presorted;													\
+ 																			\
+ loop:SWAPINIT(a, es);														\
+ 	if (n < 7)																\
+ 	{																		\
+ 		for (pm = (char *) a + es; pm < (char *) a + n * es; pm += es)		\
+ 			for (pl = pm; pl > (char *) a &&								\
+ 					OUT_COMPAR(pl - es, pl, arg) > 0;						\
+ 				 pl -= es)													\
+ 				thistype_swap(pl, pl - es);									\
+ 		return;																\
+ 	}																		\
+ 	presorted = 1;															\
+ 	for (pm = (char *) a + es; pm < (char *) a + n * es; pm += es)			\
+ 	{																		\
+ 		if (OUT_COMPAR(pm - es, pm, arg) > 0)								\
+ 		{																	\
+ 			presorted = 0;													\
+ 			break;															\
+ 		}																	\
+ 	}																		\
+ 	if (presorted)															\
+ 		return;																\
+ 	pm = (char *) a + (n / 2) * es;											\
+ 	if (n > 7)																\
+ 	{																		\
+ 		pl = (char *) a;													\
+ 		pn = (char *) a + (n - 1) * es;										\
+ 		if (n > 40)															\
+ 		{																	\
+ 			d = (n / 8) * es;												\
+ 			pl = TYPE##SPEC_VAR##med3(pl, pl + d, pl + 2 * d, arg);			\
+ 			pm = TYPE##SPEC_VAR##med3(pm - d, pm, pm + d, arg);				\
+ 			pn = TYPE##SPEC_VAR##med3(pn - 2 * d, pn - d, pn, arg);			\
+ 		}																	\
+ 		pm = TYPE##SPEC_VAR##med3(pl, pm, pn, arg);							\
+ 	}																		\
+ 	thistype_swap(a, pm);													\
+ 	pa = pb = (char *) a + es;												\
+ 	pc = pd = (char *) a + (n - 1) * es;									\
+ 	for (;;)																\
+ 	{																		\
+ 		while (pb <= pc &&													\
+ 					(r = OUT_COMPAR(pb, a, arg)) <= 0)						\
+ 		{																	\
+ 			if (r == 0)														\
+ 			{																\
+ 				thistype_swap(pa, pb);										\
+ 				pa += es;													\
+ 			}																\
+ 			pb += es;														\
+ 		}																	\
+ 		while (pb <= pc &&													\
+ 				(r = OUT_COMPAR(pc, a, arg)) >= 0)							\
+ 		{																	\
+ 			if (r == 0)														\
+ 			{																\
+ 				thistype_swap(pc, pd);										\
+ 				pd -= es;													\
+ 			}																\
+ 			pc -= es;														\
+ 		}																	\
+ 		if (pb > pc)														\
+ 			break;															\
+ 		thistype_swap(pb, pc);												\
+ 		pb += es;															\
+ 		pc -= es;															\
+ 	}																		\
+ 	pn = (char *) a + n * es;												\
+ 	r = Min(pa - (char *) a, pb - pa);										\
+ 	thistype_vecswap(a, pb - r, r);											\
+ 	r = Min(pd - pc, pn - pd - es);											\
+ 	thistype_vecswap(pb, pn - r, r);										\
+ 	if ((r = pb - pa) > es)													\
+ 		TYPE##_##SPEC_VAR##_qsort_arg(a, r / es, es, arg);					\
+ 	if ((r = pd - pc) > es)													\
+ 	{																		\
+ 		/* Iterate rather than recurse to save stack space */				\
+ 		a = pn - r;															\
+ 		n = r / es;															\
+ 		goto loop;															\
+ 	}																		\
+ }
+ 
+ /*
+  * This code becomes part of the comparator meta-function for the "reg"
+  * specialization variant of each datatype-specific specialization.
+  *
+  * Note that this is heavily based on tuplesort_comparetup_heap; the two should
+  * be kept consistent.
+  *
+  * We can handle multiple sortKeys, but the function generally will not be
+  * inlined directly when sorting. We'll try and use compile time knowledge of
+  * the comparator for later sortKeys, on the basis that that has been determined
+  * to be a win frequently enough to justify the overhead.
+  *
+  * Modern compilers are not inclined to pay too much attention to the inline
+  * keyword, and indeed inline at the callsite granularity, rather than the
+  * function granularity. It should not be assumed that the second call to the
+  * "inline" comparator here will result in a second copy of the comparator.
+  */
+ 
+ #define MULT_ADDITIONAL_CODE(COMPAR)										\
+ {																			\
+ 	HeapTupleData ltup;														\
+ 	HeapTupleData rtup;														\
+ 	TupleDesc	tupDesc;													\
+ 	int			nkey;														\
+ 																			\
+ 	/* Compare additional sort keys */										\
+ 	ltup.t_len =															\
+ 		((MinimalTuple) aT->tuple)->t_len + MINIMAL_TUPLE_OFFSET;			\
+ 	ltup.t_data =															\
+ 		(HeapTupleHeader) ((char *) aT->tuple - MINIMAL_TUPLE_OFFSET);		\
+ 	rtup.t_len =															\
+ 		((MinimalTuple) bT->tuple)->t_len + MINIMAL_TUPLE_OFFSET;			\
+ 	rtup.t_data =															\
+ 		(HeapTupleHeader) ((char *) bT->tuple - MINIMAL_TUPLE_OFFSET);		\
+ 	tupDesc = state->tupDesc;												\
+ 	sortKey++;																\
+ 	for (nkey = 1; nkey < state->nKeys; nkey++, sortKey++)					\
+ 	{																		\
+ 		AttrNumber	attno = sortKey->ssup_attno;							\
+ 		Datum		datum1,													\
+ 					datum2;													\
+ 		bool		isnull1,												\
+ 					isnull2;												\
+ 																			\
+ 		datum1 = heap_getattr(&ltup, attno, tupDesc, &isnull1);				\
+ 		datum2 = heap_getattr(&rtup, attno, tupDesc, &isnull2);				\
+ 																			\
+ 		if (sortKey->usable_compar == state->sortKeys->usable_compar)		\
+ 			compare = COMPAR(datum1, isnull1,								\
+ 								datum2, isnull2,							\
+ 								sortKey);									\
+ 		else																\
+ 			compare = ApplySortComparator(datum1, isnull1,					\
+ 										  datum2, isnull2,					\
+ 										  sortKey);							\
+ 		if (compare != 0)													\
+ 			return compare;													\
+ 	}																		\
+ }
+ 
+ #define SING_ADDITIONAL_CODE Assert(state->nKeys == 1);
+ 
+ /*
+  * Manufacture inlining variant for nKeys=1 case, and non-inlining variant
+  * for nKeys > 1 case
+  */
+ #define TEMPLATE_QSORT_ARG_HEAP(TYPE, COMPAR)								\
+ DO_TEMPLATE_QSORT_ARG(TYPE, COMPAR, inlheap,								\
+ 		SING_ADDITIONAL_CODE, TYPE##inlheapcomparetup_inline)				\
+ DO_TEMPLATE_QSORT_ARG(TYPE, COMPAR, regheap,								\
+ 		MULT_ADDITIONAL_CODE(TYPE##regheapAppFunc),							\
+ 			TYPE##regheapcomparetup_inline)
diff --git a/src/port/qsort.c b/src/port/qsort.c
new file mode 100644
index 8e2c6d9..d1981d6
*** a/src/port/qsort.c
--- b/src/port/qsort.c
***************
*** 7,13 ****
   *	  Remove ill-considered "swap_cnt" switch to insertion sort,
   *	  in favor of a simple check for presorted input.
   *
!  *	CAUTION: if you change this file, see also qsort_arg.c
   *
   *	src/port/qsort.c
   */
--- 7,14 ----
   *	  Remove ill-considered "swap_cnt" switch to insertion sort,
   *	  in favor of a simple check for presorted input.
   *
!  *	CAUTION: if you change this file, see also qsort_arg.c and
!  *	template_qsort_arg.h
   *
   *	src/port/qsort.c
   */
diff --git a/src/port/qsort_arg.c b/src/port/qsort_arg.c
new file mode 100644
index 28d1894..0ab6198
*** a/src/port/qsort_arg.c
--- b/src/port/qsort_arg.c
***************
*** 7,13 ****
   *	  Remove ill-considered "swap_cnt" switch to insertion sort,
   *	  in favor of a simple check for presorted input.
   *
!  *	CAUTION: if you change this file, see also qsort.c
   *
   *	src/port/qsort_arg.c
   */
--- 7,13 ----
   *	  Remove ill-considered "swap_cnt" switch to insertion sort,
   *	  in favor of a simple check for presorted input.
   *
!  *	CAUTION: if you change this file, see also qsort.c and template_qsort_arg.h
   *
   *	src/port/qsort_arg.c
   */

#20

peter@2ndquadrant.com

almost 14 years ago

In reply to: Peter Geoghegan (#19)

Re: Progress on fast path sorting, btree index creation time

I decided to take advantage of my ongoing access to a benchmarking
server to see how I could get on with a query with an especially large
sort. Now, that box has 16GB of ram, and some people have that much
memory in their laptops these days, so I was somewhat limited in how
far I could push things.

I eventually decided upon much the same benchmark as I'd made in my
previous e-mail (the query "SELECT * FROM pgbench_accounts ORDER BY
bid;", but primed with random data, while being sure to subsequently
vacuum).

I stress that I have not made any attempt to artificially remove
client overhead. I have, however, used a faster disk (caches were not
warmed in a seperate run of pgbench or anything like that for either
of my two most recent benchmarks, so there may have been and indeed
may still be some small bias there), in addition to connecting pgbench
with the -h option. Apparently a known problem with Linux kernels
using the Completely Fair Scheduler introduced in 2.6.23 is that it
does not schedule the pgbench program very well when it's connecting
to the database usingUnix-domain sockets, as I have been until now.
I'm not sure that this had all that much potential to spoil my
results, but didn't want to take the chance .

Anyway, the results of running this latest benchmark, with 20
successive runs of each large query, were:

Patch:

pghost: localhost pgport: nclients: 1 nxacts: 20 dbName: pgbench
transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 1
number of threads: 1
number of transactions per client: 20
number of transactions actually processed: 20/20
tps = 0.030924 (including connections establishing)
tps = 0.030924 (excluding connections establishing)
statement latencies in milliseconds:
31163.957400 SELECT * FROM pgbench_accounts ORDER BY bid;

Master:

pghost: localhost pgport: nclients: 1 nxacts: 20 dbName: pgbench
pghost: localhost pgport: nclients: 1 nxacts: 20 dbName: pgbench
transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 1
number of threads: 1
number of transactions per client: 20
number of transactions actually processed: 20/20
tps = 0.026769 (including connections establishing)
tps = 0.026769 (excluding connections establishing)
statement latencies in milliseconds:
36179.508750 SELECT * FROM pgbench_accounts ORDER BY bid;

That's a larger proportional improvement than reported on Thursday,
and at significantly greater scale.

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

#21

robertmhaas@gmail.com

almost 14 years ago

In reply to: Peter Geoghegan (#19)

Re: Progress on fast path sorting, btree index creation time

On Thu, Jan 19, 2012 at 1:47 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:

Thoughts?

I generated some random data using this query:

create table nodups (g) as select (g%10)*1000+g/10 from
generate_series(1,10000) g;

And then used pgbench to repeatedly sort it using this query:

select * from nodups order by g offset 10001;

The patch came out about 28% faster than master. Admittedly, that's
with no client overhead, but still: not bad.

I don't like the API you've designed, though: as you have it,
PrepareSortSupportFromOrderingOp does this after calling the sort
support function:

+ ssup->usable_compar = ResolveComparatorProper(sortFunction);

I think it would be better to simply have the sort support functions
set usable_compar themselves. That would allow any user-defined
functions that happen to have the same binary representation and
comparison rules as one of the types for which we supply a custom
qsort() to use initialize it to still make use of the optimization.
There's no real reason to have a separate switch to decide how to
initialize that field: the sort support trampoline already does that,
and I don't see any reason to introduce a second way of doing the same
thing.

I am also a little unhappy that we have to duplicate code the fastcmp
functions from nbtcompare.c in builtins.h. Wouldn't it make more
sense to declare those functions as inline in nbtcompare.c, and then
call the qsort-generating macro from that file?

There were a couple of comment updates in tuplesort.c that looked
independent from the reset of the patch, so I've committed those
separately. I also committed your change to downgrade the
belt-and-suspenders check for self-comparison to an assert, with some
rewording of your proposed comment.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#22

peter@2ndquadrant.com

almost 14 years ago

In reply to: Robert Haas (#21)

Re: Progress on fast path sorting, btree index creation time

On 26 January 2012 19:45, Robert Haas <robertmhaas@gmail.com> wrote:

The patch came out about 28% faster than master. Admittedly, that's
with no client overhead, but still: not bad.

Thanks. There was a 28% reduction in the time it took to execute the
query, but there would have also been a larger reduction in the time
that the backend held that all of that locally-allocated memory. That
might also be worth instrumenting directly, by turning on "trace_sort"
- can you report numbers on that, please?

I don't like the API you've designed, though: as you have it,
PrepareSortSupportFromOrderingOp does this after calling the sort
support function:

+ ssup->usable_compar = ResolveComparatorProper(sortFunction);

I think it would be better to simply have the sort support functions
set usable_compar themselves. That would allow any user-defined
functions that happen to have the same binary representation and
comparison rules as one of the types for which we supply a custom
qsort() to use initialize it to still make use of the optimization.
There's no real reason to have a separate switch to decide how to
initialize that field: the sort support trampoline already does that,
and I don't see any reason to introduce a second way of doing the same
thing.

Hmm. You're right. I can't believe that that didn't occur to me. In
practice, types that use the SortSupport API are all going to be
façades on scalar types anyway, much like date and timestamp, and of
those a good proportion will surely have the same comparator
representation as the specialisations introduced by this patch. It
might be that virtually all third-party types that end up using the
API can avail of some specialisation.

I am also a little unhappy that we have to duplicate code the fastcmp
functions from nbtcompare.c in builtins.h. Wouldn't it make more
sense to declare those functions as inline in nbtcompare.c, and then
call the qsort-generating macro from that file?

Maybe it would, but since the meta-qsort_arg introduced includes
partial duplicates of code from tuplesort.c, it kind of felt right to
"instantiate" specialisations there. It may be that doing it in
nbtcompare.c is the best option available to us. Off the top of my
head, I'm pretty sure that that's a good bit less code.

There were a couple of comment updates in tuplesort.c that looked
independent from the reset of the patch, so I've committed those
separately. I also committed your change to downgrade the
belt-and-suspenders check for self-comparison to an assert, with some
rewording of your proposed comment.

That seems reasonable.

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

#23

robertmhaas@gmail.com

almost 14 years ago

In reply to: Peter Geoghegan (#22)

Re: Progress on fast path sorting, btree index creation time

On Thu, Jan 26, 2012 at 4:09 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:

On 26 January 2012 19:45, Robert Haas <robertmhaas@gmail.com> wrote:

The patch came out about 28% faster than master. Admittedly, that's
with no client overhead, but still: not bad.

Thanks. There was a 28% reduction in the time it took to execute the
query, but there would have also been a larger reduction in the time
that the backend held that all of that locally-allocated memory. That
might also be worth instrumenting directly, by turning on "trace_sort"
- can you report numbers on that, please?

Apparently not. The sort is too short to register in the trace_sort
output. I just get:

LOG: begin tuple sort: nkeys = 1, workMem = 1024, randomAccess = f
LOG: performsort starting: CPU 0.00s/0.00u sec elapsed 0.00 sec
LOG: performsort done: CPU 0.00s/0.00u sec elapsed 0.00 sec
LOG: internal sort ended, 853 KB used: CPU 0.00s/0.00u sec elapsed 0.00 sec

I don't like the API you've designed, though: as you have it,
PrepareSortSupportFromOrderingOp does this after calling the sort
support function:

+ ssup->usable_compar = ResolveComparatorProper(sortFunction);

I think it would be better to simply have the sort support functions
set usable_compar themselves. That would allow any user-defined
functions that happen to have the same binary representation and
comparison rules as one of the types for which we supply a custom
qsort() to use initialize it to still make use of the optimization.
There's no real reason to have a separate switch to decide how to
initialize that field: the sort support trampoline already does that,
and I don't see any reason to introduce a second way of doing the same
thing.

Hmm. You're right. I can't believe that that didn't occur to me. In
practice, types that use the SortSupport API are all going to be
façades on scalar types anyway, much like date and timestamp, and of
those a good proportion will surely have the same comparator
representation as the specialisations introduced by this patch. It
might be that virtually all third-party types that end up using the
API can avail of some specialisation.

Possibly. At a minimum it keeps the door open.

I am also a little unhappy that we have to duplicate code the fastcmp
functions from nbtcompare.c in builtins.h. Wouldn't it make more
sense to declare those functions as inline in nbtcompare.c, and then
call the qsort-generating macro from that file?

Maybe it would, but since the meta-qsort_arg introduced includes
partial duplicates of code from tuplesort.c, it kind of felt right to
"instantiate" specialisations there. It may be that doing it in
nbtcompare.c is the best option available to us. Off the top of my
head, I'm pretty sure that that's a good bit less code.

I was hoping so...

There were a couple of comment updates in tuplesort.c that looked
independent from the reset of the patch, so I've committed those
separately. I also committed your change to downgrade the
belt-and-suspenders check for self-comparison to an assert, with some
rewording of your proposed comment.

That seems reasonable.

Cool.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#24

peter@2ndquadrant.com

almost 14 years ago

In reply to: Robert Haas (#23)

Re: Progress on fast path sorting, btree index creation time

Alright, so while I agree with everything you've asked for, the fact
is that there is a controversy in relation to binary bloat, and that's
the blocker here. How can we satisfactorily resolve that, or is that
question adequately addressed by the benchmark that I produced?

What if third party modules could include the template_qsort_arg.h
header, and instantiate their own specialisation? The sort support
function could either set an enum, or set a pointer to a qsort_arg
specialisation, if there comparator was a little different, but still
just a few instructions, as with float-based timestamps (I don't care
enough about them to provide one in core, though). It would also
essentially allow for user-defined sort functions, provided they
fulfilled a basic interface. They may not even have to be
comparison-based. I know that I expressed scepticism about the weird
and wonderful ideas that some people have put forward in that area,
but that's mostly because I don't think that GPU based sorting in a
database is going to be practical.

Why not add this capability to the SortSupport API, given that it
wouldn't be very hard? The user would set the enum too, so it would
appear in explain analyze output as "custom sort" or something like
that.

While a sorting specialisation of our qsort_arg is actually pretty
close to optimal for scalar types and façades thereof, there could be
a type that would benefit from using radix sort, for example, which
isn't even comparison-based, and a user couldn't very well do that
with the existing SortSupport API.

I certainly don't care about this capability enough to defend it
against any objections that anyone may have, especially at this late
stage in the cycle. I just think that we might as well have it.

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

#25

robertmhaas@gmail.com

almost 14 years ago

In reply to: Peter Geoghegan (#24)

Re: Progress on fast path sorting, btree index creation time

On Thu, Jan 26, 2012 at 5:10 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:

Alright, so while I agree with everything you've asked for, the fact
is that there is a controversy in relation to binary bloat, and that's
the blocker here. How can we satisfactorily resolve that, or is that
question adequately addressed by the benchmark that I produced?

I could go either way on that, honestly. I took a look today and
found out that on my machine, a postgres 9.0.x executable was about
282kB larger than an 8.4 postgres executable. 9.1 was about 386kB
larger than that, and 9.2 current is 239kB larger still. So it's not
as if your patch is the first one to enlarge the size of the binary -
it's obviously been growing steadily from release to release for
years. I found that your patch adds 55kB to the executable size,
which is clearly a lot more than what a typical patch adds, but still
under 1%. So I wouldn't be shocked and appalled if we decided to
commit this as-is.

But if we want to put it on a diet, the first thing I'd probably be
inclined to lose is the float4 specialization. Some members of the
audience will recall that I take dim view of floating point arithmetic
generally, but I'll concede that there are valid reasons for using
float8. I have a harder time coming up with a good reason to use
float4 - ever, for anything you care about. So I would be inclined to
think that if we want to trim this back a bit, maybe that's the one to
let go. If we want to be even more aggressive, the next thing I'd
probably lose is the optimization of multiple sortkey cases, on the
theory that single sort keys are probably by far the most common
practical case.

I'm not surprised that you weren't able to measure a performance
regression from the binary bloat. Any such regression is bound to be
very small and probably quite difficult to notice most of the time;
it's really the cumulative effect of many binary-size-increasing
changes we're worried about, not each individual one. Certainly,
trying to shrink the binary is micro-optimimzation at its finest, but
then so is inlining comparators. I don't think it can be
realistically argued that we can increasing the size of the binary
arbitrarily will never get us in trouble, much like (for a typical
American family) spending $30 to have dinner at a cheap resteraunt
will never break the budget. But if you do it every day, it gets
expensive (and fattening).

What if third party modules could include the template_qsort_arg.h
header, and instantiate their own specialisation?

Seems reasonable to me.

The sort support
function could either set an enum, or set a pointer to a qsort_arg
specialisation, if there comparator was a little different, but still

The latter seems better.

just a few instructions, as with float-based timestamps (I don't care
enough about them to provide one in core, though). It would also
essentially allow for user-defined sort functions, provided they
fulfilled a basic interface. They may not even have to be
comparison-based. I know that I expressed scepticism about the weird
and wonderful ideas that some people have put forward in that area,
but that's mostly because I don't think that GPU based sorting in a
database is going to be practical.

A question for another day.

Why not add this capability to the SortSupport API, given that it
wouldn't be very hard? The user would set the enum too, so it would
appear in explain analyze output as "custom sort" or something like
that.

I'm unenthused about going to any trouble to change the EXPLAIN format.

While a sorting specialisation of our qsort_arg is actually pretty
close to optimal for scalar types and façades thereof, there could be
a type that would benefit from using radix sort, for example, which
isn't even comparison-based, and a user couldn't very well do that
with the existing SortSupport API.

I certainly don't care about this capability enough to defend it
against any objections that anyone may have, especially at this late
stage in the cycle. I just think that we might as well have it.

I don't see any reason not too, assuming it's not a lot of code.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#26

peter@2ndquadrant.com

almost 14 years ago

In reply to: Robert Haas (#25)

Re: Progress on fast path sorting, btree index creation time

On 27 January 2012 03:32, Robert Haas <robertmhaas@gmail.com> wrote:

But if we want to put it on a diet, the first thing I'd probably be
inclined to lose is the float4 specialization. Some members of the
audience will recall that I take dim view of floating point arithmetic
generally, but I'll concede that there are valid reasons for using
float8. I have a harder time coming up with a good reason to use
float4 - ever, for anything you care about. So I would be inclined to
think that if we want to trim this back a bit, maybe that's the one to
let go. If we want to be even more aggressive, the next thing I'd
probably lose is the optimization of multiple sortkey cases, on the
theory that single sort keys are probably by far the most common
practical case.

Obviously I don't think that we should let anything go, as the
improvement in performance is so large that we're bound to be ahead -
we only really pay for what we use anyway, and when we use a
specialisation the difference is really quite big, particularly when
you look at sorting in isolation. If a specialisation is never used,
it is more or less never paid for, so there's no point in worrying
about that. That said, float4 is obviously the weakest link. I'm
inclined to think that float8 is the second weakest though, mostly
because we get both dates and timestamps "for free" with the integer
specialisations.

I'm not surprised that you weren't able to measure a performance
regression from the binary bloat. Any such regression is bound to be
very small and probably quite difficult to notice most of the time;
it's really the cumulative effect of many binary-size-increasing
changes we're worried about, not each individual one. Certainly,
trying to shrink the binary is micro-optimimzation at its finest, but
then so is inlining comparators. I don't think it can be
realistically argued that we can increasing the size of the binary
arbitrarily will never get us in trouble, much like (for a typical
American family) spending $30 to have dinner at a cheap resteraunt
will never break the budget. But if you do it every day, it gets
expensive (and fattening).

Sure. At the risk of stating the obvious, and of repeating myself, I
will point out that the true cost of increasing the size of the binary
is not necessarily linear - it's a complex equation. I hope that this
doesn't sound flippant, but if some naive person were to look at just
the increasing binary size of Postgres and its performance in each
successive release, they might conclude that there was a positive
correlation between the two (since we didn't add flab to the binary,
but muscle that pulls its own weight and then some).

At the continued risk of stating the obvious, CPUs don't just cache
instructions - they cache data too. If we spend less than half the
time sorting data, which is the level of improvement I was able to
demonstrate against pre-SortSupport Postgres, that will surely very
often have the aggregate effect of ameliorating cache contention
between cores.

just a few instructions, as with float-based timestamps (I don't care
enough about them to provide one in core, though). It would also
essentially allow for user-defined sort functions, provided they
fulfilled a basic interface. They may not even have to be
comparison-based. I know that I expressed scepticism about the weird
and wonderful ideas that some people have put forward in that area,
but that's mostly because I don't think that GPU based sorting in a
database is going to be practical.

A question for another day.

Fair enough.

I certainly don't care about this capability enough to defend it
against any objections that anyone may have, especially at this late
stage in the cycle. I just think that we might as well have it.

I don't see any reason not too, assuming it's not a lot of code.

Good.

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

#27

robertmhaas@gmail.com

almost 14 years ago

In reply to: Peter Geoghegan (#26)

Re: Progress on fast path sorting, btree index creation time

On Thu, Jan 26, 2012 at 11:36 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:

I'm not surprised that you weren't able to measure a performance
regression from the binary bloat. Any such regression is bound to be
very small and probably quite difficult to notice most of the time;
it's really the cumulative effect of many binary-size-increasing
changes we're worried about, not each individual one. Certainly,
trying to shrink the binary is micro-optimimzation at its finest, but
then so is inlining comparators. I don't think it can be
realistically argued that we can increasing the size of the binary
arbitrarily will never get us in trouble, much like (for a typical
American family) spending $30 to have dinner at a cheap resteraunt
will never break the budget. But if you do it every day, it gets
expensive (and fattening).

Sure. At the risk of stating the obvious, and of repeating myself, I
will point out that the true cost of increasing the size of the binary
is not necessarily linear - it's a complex equation. I hope that this
doesn't sound flippant, but if some naive person were to look at just
the increasing binary size of Postgres and its performance in each
successive release, they might conclude that there was a positive
correlation between the two (since we didn't add flab to the binary,
but muscle that pulls its own weight and then some).

I completely agree. So the point is that, when faced a patch that
adds an atypically large number of CPU instructions, we ought to ask
ourselves whether those instructions are pulling their weight. By way
of comparison, the latest posted version of Tom's generalized
index-only paths patch adds 14kB to the resulting executable (on my
Mac). Yours adds 55kB. We might ask ourselves whether the benefits
of this patch are four times greater than the benefits of Tom's patch.
It's pretty hard to compare them directly since they're doing very
different things, and all of this is completely subjective, but I
doubt it. On the other hand, there is no rule that every byte of code
that gets committed must be made of solid gold, either. So, once
again: judgement call.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#28

peter@2ndquadrant.com

almost 14 years ago

In reply to: Robert Haas (#27)

Re: Progress on fast path sorting, btree index creation time

Uh, obviously I meant causal relationship and not correlation.

On 27 January 2012 13:37, Robert Haas <robertmhaas@gmail.com> wrote:

I completely agree. So the point is that, when faced a patch that
adds an atypically large number of CPU instructions, we ought to ask
ourselves whether those instructions are pulling their weight. By way
of comparison, the latest posted version of Tom's generalized
index-only paths patch adds 14kB to the resulting executable (on my
Mac). Yours adds 55kB. We might ask ourselves whether the benefits
of this patch are four times greater than the benefits of Tom's patch.
It's pretty hard to compare them directly since they're doing very
different things, and all of this is completely subjective, but I
doubt it. On the other hand, there is no rule that every byte of code
that gets committed must be made of solid gold, either. So, once
again: judgement call.

Well, I don't think it's all that subjective - it's more the case that
it is just difficult, or it gets that way as you consider more
specialisations.

As for what types/specialisations may not make the cut, I'm
increasingly convinced that floats (in the following order: float4,
float8) should be the first to go. Aside from the fact that we cannot
use their specialisations for anything like dates and timestamps,
floats are just way less useful than integers in the context of
database applications, or at least those that I've been involved with.
As important as floats are in the broad context of computing, it's
usually only acceptable to store data in a database as floats within
scientific applications, and only then when their limitations are
well-understood and acceptable. I think we've all heard anecdotes at
one time or another, involving their limitations not being well
understood.

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

#29

robertmhaas@gmail.com

almost 14 years ago

In reply to: Peter Geoghegan (#28)

Re: Progress on fast path sorting, btree index creation time

On Fri, Jan 27, 2012 at 9:27 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote:

Well, I don't think it's all that subjective - it's more the case that
it is just difficult, or it gets that way as you consider more
specialisations.

Sure it's subjective. Two well-meaning people could have different
opinions without either of them being "wrong". If you do a lot of
small, in-memory sorts, more of this stuff is going to seem worthwhile
than if you don't.

As for what types/specialisations may not make the cut, I'm
increasingly convinced that floats (in the following order: float4,
float8) should be the first to go. Aside from the fact that we cannot
use their specialisations for anything like dates and timestamps,
floats are just way less useful than integers in the context of
database applications, or at least those that I've been involved with.
As important as floats are in the broad context of computing, it's
usually only acceptable to store data in a database as floats within
scientific applications, and only then when their limitations are
well-understood and acceptable. I think we've all heard anecdotes at
one time or another, involving their limitations not being well
understood.

While we're waiting for anyone else to weigh in with an opinion on the
right place to draw the line here, do you want to post an updated
patch with the changes previously discussed?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#30

peter@2ndquadrant.com

almost 14 years ago

In reply to: Robert Haas (#29)

1 attachment(s)

Re: Progress on fast path sorting, btree index creation time

On 27 January 2012 14:37, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jan 27, 2012 at 9:27 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote:

Well, I don't think it's all that subjective - it's more the case that
it is just difficult, or it gets that way as you consider more
specialisations.

Sure it's subjective. Two well-meaning people could have different
opinions without either of them being "wrong". If you do a lot of
small, in-memory sorts, more of this stuff is going to seem worthwhile
than if you don't.

But if you don't, then you're not going to have your cache
compromised, so the cost is limited to having to store a few tens of
kilobytes of extra binary executable data on disk, and perhaps
main-memory, that you wouldn't otherwise have to - a cost that is
virtually indistinguishable from zero. When you do eventually need to
do some in-memory sort, you get to have that go significantly faster,
and since you don't have much use for the specialisations anyway, you
get that with essentially no down-side.

The concern is a perfect storm of all specialisations being
simultaneously used such that it'd be more efficient to use a generic
qsort. I think that's pretty implausible - the general assumption is
that database applications are not frequently CPU bound. They're
assumed to be frequently memory-bound though, so any effort to reduced
memory consumption - which this patch effectively does - is probably
going to be more valuable.

Even if we suppose that the perfect storm can and does occur, on a
chip that is so starved of instruction cache that it turns out to be a
net loss, surely even then the perfect storm is a rare occurrence, and
the aggregate effect is that they benefit. Besides, Postgres
performance optimisations for which you can contrive a case that
results in a net-loss in performance are well precedented.

As for what types/specialisations may not make the cut, I'm
increasingly convinced that floats (in the following order: float4,
float8) should be the first to go. Aside from the fact that we cannot
use their specialisations for anything like dates and timestamps,
floats are just way less useful than integers in the context of
database applications, or at least those that I've been involved with.
As important as floats are in the broad context of computing, it's
usually only acceptable to store data in a database as floats within
scientific applications, and only then when their limitations are
well-understood and acceptable. I think we've all heard anecdotes at
one time or another, involving their limitations not being well
understood.

While we're waiting for anyone else to weigh in with an opinion on the
right place to draw the line here, do you want to post an updated
patch with the changes previously discussed?

Patch is attached. I have not changed the duplicate functions. This is
because I concluded that it was the lesser of two evils to have to get
the template to generate both declarations in the header file, and
definitions in the .c file - that seemed particularly obscure. We're
never going to have to expose/duplicate any more comparators anyway.
Do you agree?

It's pretty easy to remove a specialisation at any time - just remove
less than 10 lines of code. It's also pretty difficult to determine,
with everyone's absolute confidence, where the right balance lies.
Perhaps the sensible thing to do is to not be so conservative in what
we initially commit, while clearly acknowledging that we may not have
the balance right, and that it may have to change. We then have the
entire beta part of the cycle in which to decide to roll back from
that position, without any plausible downside. If, on the other hand,
we conservatively lean towards fewer specialisations in the initial
commit, no one will complain about the lack of an improvement in
performance that they never had.

Tom's objections related to the total number of specialisations, and
their distributed costs - the very idea of full specialisations was
not objected to. I think it's fair to say that there is no controversy
at all remaining about whether or not we should have *some* number of
specialisations. Therefore, I'm going to suggest that assuming you
have no further objections to the style of the code, and no one else
voices any other objections in the next couple of days, that you
provisionally commit this latest revision with all of its
specialisations, while putting people on notice about this.

I think that possibly the one remaining blocker to tentatively
committing this with all specialisations intact is that I haven't
tested this on Windows, as I don't currently have access to a Windows
development environment. I have set one up before, but it's a huge
pain. Can anyone help me out?

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Attachments:

fastpath_sort_2012_01_27.patchtext/x-patch; charset=US-ASCII; name=fastpath_sort_2012_01_27.patchDownload

diff --git a/src/backend/access/nbtree/nbtcompare.c b/src/backend/access/nbtree/nbtcompare.c
new file mode 100644
index fedde93..096b8bc
*** a/src/backend/access/nbtree/nbtcompare.c
--- b/src/backend/access/nbtree/nbtcompare.c
***************
*** 49,55 ****
  #include "postgres.h"
  
  #include "utils/builtins.h"
- #include "utils/sortsupport.h"
  
  
  Datum
--- 49,54 ----
*************** btint4cmp(PG_FUNCTION_ARGS)
*** 102,108 ****
  		PG_RETURN_INT32(-1);
  }
  
! static int
  btint4fastcmp(Datum x, Datum y, SortSupport ssup)
  {
  	int32		a = DatumGetInt32(x);
--- 101,109 ----
  		PG_RETURN_INT32(-1);
  }
  
! #ifndef USE_INLINE
! 
! int
  btint4fastcmp(Datum x, Datum y, SortSupport ssup)
  {
  	int32		a = DatumGetInt32(x);
*************** btint4fastcmp(Datum x, Datum y, SortSupp
*** 116,127 ****
--- 117,131 ----
  		return -1;
  }
  
+ #endif   /* ! USE_INLINE */
+ 
  Datum
  btint4sortsupport(PG_FUNCTION_ARGS)
  {
  	SortSupport	ssup = (SortSupport) PG_GETARG_POINTER(0);
  
  	ssup->comparator = btint4fastcmp;
+ 	ssup->usable_compar = TYPE_COMP_INT4;
  	PG_RETURN_VOID();
  }
  
*************** btint8cmp(PG_FUNCTION_ARGS)
*** 139,145 ****
  		PG_RETURN_INT32(-1);
  }
  
! static int
  btint8fastcmp(Datum x, Datum y, SortSupport ssup)
  {
  	int64		a = DatumGetInt64(x);
--- 143,151 ----
  		PG_RETURN_INT32(-1);
  }
  
! #ifndef USE_INLINE
! 
! int
  btint8fastcmp(Datum x, Datum y, SortSupport ssup)
  {
  	int64		a = DatumGetInt64(x);
*************** btint8fastcmp(Datum x, Datum y, SortSupp
*** 153,164 ****
--- 159,173 ----
  		return -1;
  }
  
+ #endif   /* ! USE_INLINE */
+ 
  Datum
  btint8sortsupport(PG_FUNCTION_ARGS)
  {
  	SortSupport	ssup = (SortSupport) PG_GETARG_POINTER(0);
  
  	ssup->comparator = btint8fastcmp;
+ 	ssup->usable_compar = TYPE_COMP_INT8;
  	PG_RETURN_VOID();
  }
  
diff --git a/src/backend/utils/adt/date.c b/src/backend/utils/adt/date.c
new file mode 100644
index bf3b6ea..0ae54d7
*** a/src/backend/utils/adt/date.c
--- b/src/backend/utils/adt/date.c
*************** date_sortsupport(PG_FUNCTION_ARGS)
*** 340,345 ****
--- 340,346 ----
  	SortSupport	ssup = (SortSupport) PG_GETARG_POINTER(0);
  
  	ssup->comparator = date_fastcmp;
+ 	ssup->usable_compar = TYPE_COMP_INT4;
  	PG_RETURN_VOID();
  }
  
diff --git a/src/backend/utils/adt/float.c b/src/backend/utils/adt/float.c
new file mode 100644
index 08434dc..8dd137e
*** a/src/backend/utils/adt/float.c
--- b/src/backend/utils/adt/float.c
***************
*** 23,29 ****
  #include "libpq/pqformat.h"
  #include "utils/array.h"
  #include "utils/builtins.h"
- #include "utils/sortsupport.h"
  
  
  #ifndef M_PI
--- 23,28 ----
*************** do {															\
*** 67,76 ****
  /* Configurable GUC parameter */
  int			extra_float_digits = 0;		/* Added to DBL_DIG or FLT_DIG */
  
- 
- static int	float4_cmp_internal(float4 a, float4 b);
- static int	float8_cmp_internal(float8 a, float8 b);
- 
  #ifndef HAVE_CBRT
  /*
   * Some machines (in particular, some versions of AIX) have an extern
--- 66,71 ----
*************** float8div(PG_FUNCTION_ARGS)
*** 844,850 ****
  /*
   *		float4{eq,ne,lt,le,gt,ge}		- float4/float4 comparison operations
   */
! static int
  float4_cmp_internal(float4 a, float4 b)
  {
  	/*
--- 839,847 ----
  /*
   *		float4{eq,ne,lt,le,gt,ge}		- float4/float4 comparison operations
   */
! #ifndef USE_INLINE
! 
! int
  float4_cmp_internal(float4 a, float4 b)
  {
  	/*
*************** float4_cmp_internal(float4 a, float4 b)
*** 874,879 ****
--- 871,878 ----
  	}
  }
  
+ #endif   /* ! USE_INLINE */
+ 
  Datum
  float4eq(PG_FUNCTION_ARGS)
  {
*************** btfloat4cmp(PG_FUNCTION_ARGS)
*** 937,943 ****
  	PG_RETURN_INT32(float4_cmp_internal(arg1, arg2));
  }
  
! static int
  btfloat4fastcmp(Datum x, Datum y, SortSupport ssup)
  {
  	float4		arg1 = DatumGetFloat4(x);
--- 936,945 ----
  	PG_RETURN_INT32(float4_cmp_internal(arg1, arg2));
  }
  
! 
! #ifndef USE_INLINE
! 
! int
  btfloat4fastcmp(Datum x, Datum y, SortSupport ssup)
  {
  	float4		arg1 = DatumGetFloat4(x);
*************** btfloat4fastcmp(Datum x, Datum y, SortSu
*** 946,964 ****
  	return float4_cmp_internal(arg1, arg2);
  }
  
  Datum
  btfloat4sortsupport(PG_FUNCTION_ARGS)
  {
  	SortSupport	ssup = (SortSupport) PG_GETARG_POINTER(0);
  
  	ssup->comparator = btfloat4fastcmp;
  	PG_RETURN_VOID();
  }
  
  /*
   *		float8{eq,ne,lt,le,gt,ge}		- float8/float8 comparison operations
   */
! static int
  float8_cmp_internal(float8 a, float8 b)
  {
  	/*
--- 948,972 ----
  	return float4_cmp_internal(arg1, arg2);
  }
  
+ #endif   /* ! USE_INLINE */
+ 
  Datum
  btfloat4sortsupport(PG_FUNCTION_ARGS)
  {
  	SortSupport	ssup = (SortSupport) PG_GETARG_POINTER(0);
  
  	ssup->comparator = btfloat4fastcmp;
+ 	ssup->usable_compar = TYPE_COMP_FLOAT4;
  	PG_RETURN_VOID();
  }
  
  /*
   *		float8{eq,ne,lt,le,gt,ge}		- float8/float8 comparison operations
   */
! 
! #ifndef USE_INLINE
! 
! int
  float8_cmp_internal(float8 a, float8 b)
  {
  	/*
*************** float8_cmp_internal(float8 a, float8 b)
*** 988,993 ****
--- 996,1003 ----
  	}
  }
  
+ #endif   /* ! USE_INLINE */
+ 
  Datum
  float8eq(PG_FUNCTION_ARGS)
  {
*************** btfloat8cmp(PG_FUNCTION_ARGS)
*** 1051,1057 ****
  	PG_RETURN_INT32(float8_cmp_internal(arg1, arg2));
  }
  
! static int
  btfloat8fastcmp(Datum x, Datum y, SortSupport ssup)
  {
  	float8		arg1 = DatumGetFloat8(x);
--- 1061,1069 ----
  	PG_RETURN_INT32(float8_cmp_internal(arg1, arg2));
  }
  
! #ifndef USE_INLINE
! 
! int
  btfloat8fastcmp(Datum x, Datum y, SortSupport ssup)
  {
  	float8		arg1 = DatumGetFloat8(x);
*************** btfloat8fastcmp(Datum x, Datum y, SortSu
*** 1060,1071 ****
--- 1072,1086 ----
  	return float8_cmp_internal(arg1, arg2);
  }
  
+ #endif   /* ! USE_INLINE */
+ 
  Datum
  btfloat8sortsupport(PG_FUNCTION_ARGS)
  {
  	SortSupport	ssup = (SortSupport) PG_GETARG_POINTER(0);
  
  	ssup->comparator = btfloat8fastcmp;
+ 	ssup->usable_compar= TYPE_COMP_FLOAT8;
  	PG_RETURN_VOID();
  }
  
diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c
new file mode 100644
index db8d2c2..b7e090b
*** a/src/backend/utils/adt/timestamp.c
--- b/src/backend/utils/adt/timestamp.c
*************** timestamp_sortsupport(PG_FUNCTION_ARGS)
*** 1846,1851 ****
--- 1846,1854 ----
  	SortSupport	ssup = (SortSupport) PG_GETARG_POINTER(0);
  
  	ssup->comparator = timestamp_fastcmp;
+ #ifdef HAVE_INT64_TIMESTAMP
+ 	ssup->usable_compar = TYPE_COMP_INT8;
+ #endif
  	PG_RETURN_VOID();
  }
  
diff --git a/src/backend/utils/sort/sortsupport.c b/src/backend/utils/sort/sortsupport.c
new file mode 100644
index 7f388fd..0d77b3d
*** a/src/backend/utils/sort/sortsupport.c
--- b/src/backend/utils/sort/sortsupport.c
*************** PrepareSortSupportFromOrderingOp(Oid ord
*** 139,144 ****
--- 139,147 ----
  	Oid			sortFunction;
  	bool		issupport;
  
+ 	/* Don't rely on sortsupport function to set comparator representation */
+ 	ssup->usable_compar = TYPE_COMP_OTHER;
+ 
  	if (!get_sort_function_for_ordering_op(orderingOp,
  										   &sortFunction,
  										   &issupport,
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index 1452e8c..718d2ee
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
***************
*** 106,111 ****
--- 106,112 ----
  #include "executor/executor.h"
  #include "miscadmin.h"
  #include "pg_trace.h"
+ #include "utils/builtins.h"
  #include "utils/datum.h"
  #include "utils/logtape.h"
  #include "utils/lsyscache.h"
***************
*** 113,118 ****
--- 114,120 ----
  #include "utils/pg_rusage.h"
  #include "utils/rel.h"
  #include "utils/sortsupport.h"
+ #include "utils/template_qsort_arg.h"
  #include "utils/tuplesort.h"
  
  
*************** static void tuplesort_heap_insert(Tuples
*** 456,461 ****
--- 458,464 ----
  static void tuplesort_heap_siftup(Tuplesortstate *state, bool checkIndex);
  static unsigned int getlen(Tuplesortstate *state, int tapenum, bool eofOK);
  static void markrunend(Tuplesortstate *state, int tapenum);
+ static inline int comparetup_inline(void *a, void *b, Tuplesortstate *state);
  static int comparetup_heap(const SortTuple *a, const SortTuple *b,
  				Tuplesortstate *state);
  static void copytup_heap(Tuplesortstate *state, SortTuple *stup, void *tup);
*************** static void readtup_datum(Tuplesortstate
*** 492,497 ****
--- 495,536 ----
  static void reversedirection_datum(Tuplesortstate *state);
  static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
  
+ /*
+  * Manufacture type specific sorting specialisations with inline comparators.
+  *
+  * Use fast comparator functions for "comparator proper" in each case. They're
+  * generally used via function pointers to elide SQL-function-call indirection
+  * when resolving the comparator of a btree index opclass for sorting
+  * (historically, tuplesort went through pg_proc in all cases) but that doesn't
+  * provide the additional benefits of inlining and other optimizations from
+  * compile-time comparator knowledge, through the generation of complete
+  * per-type sorting specializations.
+  */
+ #define int4_shim(x,y)			btint4fastcmp(x, y, NULL)
+ #define int8_shim(x,y)			btint8fastcmp(x, y, NULL)
+ #define float4_shim(x,y)		btfloat4fastcmp(x, y, NULL)
+ #define float8_shim(x,y)		btfloat8fastcmp(x, y, NULL)
+ 
+ #define noop_cmp_proper(nop1, nop2) 0
+ #define noop_code (void)0;
+ 
+ /* Instantiate sorting specializations (for heap tuples only) */
+ TEMPLATE_QSORT_ARG_HEAP(int4, int4_shim);
+ TEMPLATE_QSORT_ARG_HEAP(int8, int8_shim);
+ TEMPLATE_QSORT_ARG_HEAP(float4, float4_shim);
+ TEMPLATE_QSORT_ARG_HEAP(float8, float8_shim);
+ 
+ /*
+  * A type-neutral specialization, for single sort-key sorts not covered by other
+  * specializations. This specialization could be used for sorting both heap and
+  * index tuples that meet that criteria. However, it is currently only used for
+  * heap tuples.
+  *
+  * This could result in dead code, but since the redundant functions generated
+  * are all static, they will almost invariably be discarded by compiler
+  * optimizations performed by all modern compilers.
+  */
+ DO_TEMPLATE_QSORT_ARG(all, noop_cmp_proper, sing, noop_code, comparetup_inline);
  
  /*
   *		tuplesort_begin_xxx
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 631,636 ****
--- 670,716 ----
  		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
  	}
  
+ 	Assert(state->sortKeys->usable_compar == TYPE_COMP_FULL_SPECIALISATION ||
+ 			state->sortKeys->qsort_arg_spec == NULL);
+ 
+ 	if (state->sortKeys)
+ 	{
+ 		switch(state->sortKeys->usable_compar)
+ 		{
+ 			case TYPE_COMP_OTHER:
+ 				state->sortKeys->qsort_arg_spec =
+ 					nkeys==1?all_sing_qsort_arg:NULL;
+ 				break;
+ 			case TYPE_COMP_INT4:
+ 				state->sortKeys->qsort_arg_spec =
+ 					nkeys==1?int4_inlheap_qsort_arg:int4_regheap_qsort_arg;
+ 				break;
+ 			case TYPE_COMP_INT8:
+ 				state->sortKeys->qsort_arg_spec =
+ 					nkeys==1?int8_inlheap_qsort_arg:int8_regheap_qsort_arg;
+ 				break;
+ 			case TYPE_COMP_FLOAT4:
+ 				state->sortKeys->qsort_arg_spec =
+ 					nkeys==1?float4_inlheap_qsort_arg:float4_regheap_qsort_arg;
+ 				break;
+ 			case TYPE_COMP_FLOAT8:
+ 				state->sortKeys->qsort_arg_spec =
+ 					nkeys==1?float8_inlheap_qsort_arg:float8_regheap_qsort_arg;
+ 				break;
+ 			case TYPE_COMP_FULL_SPECIALISATION:
+ 				Assert(state->sortKeys->qsort_arg_spec != NULL);
+ 				break;
+ 			default:
+ 				elog(ERROR, "unrecognized comparator type for heap tuplesort");
+ 		}
+ #ifdef TRACE_SORT
+ 		if (trace_sort)
+ 			elog(LOG,
+ 				 "begin tuple sort: specialization used: %d",
+ 					state->sortKeys->usable_compar);
+ #endif
+ 
+ 	}
  	MemoryContextSwitchTo(oldcontext);
  
  	return state;
*************** tuplesort_performsort(Tuplesortstate *st
*** 1221,1232 ****
  			 * We were able to accumulate all the tuples within the allowed
  			 * amount of memory.  Just qsort 'em and we're done.
  			 */
  			if (state->memtupcount > 1)
! 				qsort_arg((void *) state->memtuples,
! 						  state->memtupcount,
! 						  sizeof(SortTuple),
! 						  (qsort_arg_comparator) state->comparetup,
! 						  (void *) state);
  			state->current = 0;
  			state->eof_reached = false;
  			state->markpos_offset = 0;
--- 1301,1329 ----
  			 * We were able to accumulate all the tuples within the allowed
  			 * amount of memory.  Just qsort 'em and we're done.
  			 */
+ 
  			if (state->memtupcount > 1)
! 			{
! 				/* Use a sorting specialization if available */
! 				if (state->sortKeys && state->sortKeys->qsort_arg_spec)
! 					/* specialization available */
! 					state->sortKeys->qsort_arg_spec(
! 								(void *) state->memtuples,
! 								state->memtupcount,
! 								sizeof(SortTuple),
! 								(void *) state);
! 				else
! 					/*
! 					 * Fall back on regular qsort_arg, with function pointer
! 					 * comparator, making no compile-time assumptions about the
! 					 * number of sortkeys or the datatype(s) to be sorted.
! 					 */
! 					qsort_arg((void *) state->memtuples,
! 								  state->memtupcount,
! 								  sizeof(SortTuple),
! 								  (qsort_arg_comparator) state->comparetup,
! 								  (void *) state);
! 			}
  			state->current = 0;
  			state->eof_reached = false;
  			state->markpos_offset = 0;
*************** inlineApplySortFunction(FmgrInfo *sortFu
*** 2645,2653 ****
--- 2742,2779 ----
  	return compare;
  }
  
+ /*
+  * This is a cut-down duplicate of comparetup_heap that exists for the express
+  * purpose of generating a sorting specialization in which it is inlined, as a
+  * performance optimization. It does not make any assumptions about the
+  * representation of the tuple proper (it could equally well be a heap or index
+  * tuple sort, though only heap tuples are currently supported).
+  *
+  * This is only possible for sorts with a single sortKey. Note that the
+  * comparator proper is not inlined, so this is only used as part of the one and
+  * only partial specialization.
+  */
+ static inline int
+ comparetup_inline(void *a, void *b, Tuplesortstate *state)
+ {
+ 	/* void* parameters are used to shut-up the compiler */
+ 	const SortTuple *aT = (const SortTuple*) a, *bT = (const SortTuple*) b;
+ 	/* Allow interrupting long sorts */
+ 	CHECK_FOR_INTERRUPTS();
+ 
+ 	Assert(state->nKeys == 1);
+ 
+ 	/* Compare the one and only sort key with minimal indirection */
+ 	return ApplySortComparator(aT->datum1, aT->isnull1,
+ 								  bT->datum1, bT->isnull1,
+ 								  state->sortKeys);
+ }
  
  /*
   * Routines specialized for HeapTuple (actually MinimalTuple) case
+  *
+  * This code is partially duplicated within template_qsort_arg.h. Ensure that
+  * they are kept consistent.
   */
  
  static int
*************** comparetup_index_btree(const SortTuple *
*** 2981,2990 ****
  					   Tuplesortstate *state)
  {
  	/*
! 	 * This is similar to _bt_tuplecompare(), but we have already done the
! 	 * index_getattr calls for the first column, and we need to keep track of
! 	 * whether any null fields are present.  Also see the special treatment
! 	 * for equal keys at the end.
  	 */
  	ScanKey		scanKey = state->indexScanKey;
  	IndexTuple	tuple1;
--- 3107,3115 ----
  					   Tuplesortstate *state)
  {
  	/*
! 	 * We have already done the index_getattr calls for the first column, and we
! 	 * need to keep track of whether any null fields are present.  Also see the
! 	 * special treatment for equal keys at the end.
  	 */
  	ScanKey		scanKey = state->indexScanKey;
  	IndexTuple	tuple1;
diff --git a/src/include/c.h b/src/include/c.h
new file mode 100644
index 7396adb..939f49a
*** a/src/include/c.h
--- b/src/include/c.h
*************** extern int	fdatasync(int fildes);
*** 850,853 ****
--- 850,865 ----
  /* /port compatibility functions */
  #include "port.h"
  
+ /*
+  * Define a cross-platform "always-inline" macro. This is a very sharp tool that
+  * should be used judiciously.
+  */
+ #ifdef __always_inline
+ #define pg_always_inline __always_inline
+ #elif defined(__force_inline)
+ #define pg_always_inline __force_inline
+ #else
+ #define pg_always_inline inline
+ #endif
+ 
  #endif   /* C_H */
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
new file mode 100644
index 68179d5..669467f
*** a/src/include/utils/builtins.h
--- b/src/include/utils/builtins.h
***************
*** 14,21 ****
--- 14,24 ----
  #ifndef BUILTINS_H
  #define BUILTINS_H
  
+ #include <math.h>
+ 
  #include "fmgr.h"
  #include "nodes/parsenodes.h"
+ #include "utils/sortsupport.h"
  
  /*
   *		Defined in adt/
*************** extern Datum btfloat8sortsupport(PG_FUNC
*** 323,328 ****
--- 326,459 ----
  extern Datum btoidsortsupport(PG_FUNCTION_ARGS);
  extern Datum btnamesortsupport(PG_FUNCTION_ARGS);
  
+ /*
+  *		Expose some "fast" variants of per-opclass comparison functions that
+  *		will be returned by some of the above sort-support functions. Do so to
+  *		avail of optimizations from compile-time knowledge of comparators,
+  *		including but not limited to inlining, both for its innate value in
+  *		improving locality of reference, and its value as an enabling
+  *		transformation.
+  */
+ 
+ #ifdef USE_INLINE
+ 
+ static inline int
+ btint4fastcmp(Datum x, Datum y, SortSupport ssup)
+ {
+ 	int32		a = DatumGetInt32(x);
+ 	int32		b = DatumGetInt32(y);
+ 
+ 	if (a > b)
+ 		return 1;
+ 	else if (a == b)
+ 		return 0;
+ 	else
+ 		return -1;
+ }
+ 
+ static inline int
+ btint8fastcmp(Datum x, Datum y, SortSupport ssup)
+ {
+ 	int64		a = DatumGetInt64(x);
+ 	int64		b = DatumGetInt64(y);
+ 
+ 	if (a > b)
+ 		return 1;
+ 	else if (a == b)
+ 		return 0;
+ 	else
+ 		return -1;
+ }
+ 
+ static inline int
+ float4_cmp_internal(float4 a, float4 b)
+ {
+ 	/*
+ 	 * We consider all NANs to be equal and larger than any non-NAN. This is
+ 	 * somewhat arbitrary; the important thing is to have a consistent sort
+ 	 * order.
+ 	 */
+ 	if (isnan(a))
+ 	{
+ 		if (isnan(b))
+ 			return 0;			/* NAN = NAN */
+ 		else
+ 			return 1;			/* NAN > non-NAN */
+ 	}
+ 	else if (isnan(b))
+ 	{
+ 		return -1;				/* non-NAN < NAN */
+ 	}
+ 	else
+ 	{
+ 		if (a > b)
+ 			return 1;
+ 		else if (a < b)
+ 			return -1;
+ 		else
+ 			return 0;
+ 	}
+ }
+ 
+ static inline int
+ float8_cmp_internal(float8 a, float8 b)
+ {
+ 	/*
+ 	 * We consider all NANs to be equal and larger than any non-NAN. This is
+ 	 * somewhat arbitrary; the important thing is to have a consistent sort
+ 	 * order.
+ 	 */
+ 	if (isnan(a))
+ 	{
+ 		if (isnan(b))
+ 			return 0;			/* NAN = NAN */
+ 		else
+ 			return 1;			/* NAN > non-NAN */
+ 	}
+ 	else if (isnan(b))
+ 	{
+ 		return -1;				/* non-NAN < NAN */
+ 	}
+ 	else
+ 	{
+ 		if (a > b)
+ 			return 1;
+ 		else if (a < b)
+ 			return -1;
+ 		else
+ 			return 0;
+ 	}
+ }
+ 
+ static inline int
+ btfloat4fastcmp(Datum x, Datum y, SortSupport ssup)
+ {
+ 	float4		arg1 = DatumGetFloat4(x);
+ 	float4		arg2 = DatumGetFloat4(y);
+ 
+ 	return float4_cmp_internal(arg1, arg2);
+ }
+ 
+ static inline int
+ btfloat8fastcmp(Datum x, Datum y, SortSupport ssup)
+ {
+ 	float8		arg1 = DatumGetFloat8(x);
+ 	float8		arg2 = DatumGetFloat8(y);
+ 
+ 	return float8_cmp_internal(arg1, arg2);
+ }
+ 
+ #else
+ 
+ extern int btint4fastcmp(Datum x, Datum y, SortSupport ssup);
+ extern int btint8fastcmp(Datum x, Datum y, SortSupport ssup);
+ extern int float4_cmp_internal(float4 a, float4 b);
+ extern int float8_cmp_internal(float8 a, float8 b);
+ extern int btfloat4fastcmp(Datum x, Datum y, SortSupport ssup);
+ extern int btfloat8fastcmp(Datum x, Datum y, SortSupport ssup);
+ 
+ #endif   /* USE_INLINE */
+ 
  /* float.c */
  extern PGDLLIMPORT int extra_float_digits;
  
diff --git a/src/include/utils/sortsupport.h b/src/include/utils/sortsupport.h
new file mode 100644
index ef8d853..eb04f11
*** a/src/include/utils/sortsupport.h
--- b/src/include/utils/sortsupport.h
***************
*** 31,36 ****
--- 31,46 ----
   * data can be stored using the ssup_extra field.  Any such data
   * should be allocated in the ssup_cxt memory context.
   *
+  * In addition to providing lower-overhead comparators, this infrastructure also
+  * allows client SortSupport functions to resolve a generic comparator
+  * representation for the type from a static set. These are available in core
+  * for only a subset of built-in sort support functions. However, SortSupport
+  * functions may also go so far as to provide their own full specialization, by
+  * assigning a value to the SortSupport function pointer, typically though not
+  * necessarily by "instantiating" their own qsort specialization using the
+  * facility within template_qsort_arg.h. This enables more aggressive per-type
+  * optimizations at sort time.
+  *
   * Note: since pg_amproc functions are indexed by (lefttype, righttype)
   * it is possible to associate a BTSORTSUPPORT function with a cross-type
   * comparison.  This could sensibly be used to provide a fast comparator
***************
*** 49,54 ****
--- 59,98 ----
  
  #include "access/attnum.h"
  
+ /*
+  * Which comparator representation can be used for this type? It is acceptable
+  * to use an int4 comparator with the date datatype for example, because both
+  * types have the same underlying representation. In particular, their
+  * comparators are interchangeable.
+  *
+  * It is incorrect to do this with types that share a certain bitwise
+  * representation with some scalar type, represented by an enum constant here,
+  * when they do not have interchangeable comparators. For example, sortkeys of
+  * legacy float8 representation of timestamps will not be set to
+  * TYPE_COMP_FLOAT8, because its comparator has special handling of NaN values.
+  *
+  * It's not useful to do this with types that have inherently high-overhead
+  * comparisons, of greater than a few instructions; the cost of the comparison
+  * itself is expected to dominate, marginalizing any benefit.
+  *
+  * SortSupport client functions can set their comparator representation to
+  * TYPE_COMP_FULL_SPECIALISATION, in which case they must also set the
+  * qsort_arg_spec function pointer, to a function that fulfills the interface of
+  * a sorting specialization, typically by "instantiating" a sorting
+  * specialization using the infrastructure within template_qsort_arg.h, but,
+  * conceivably, by generalizing from the example of that infrastructure to
+  * produce a novel, custom sort function for the type.
+  */
+ typedef enum TypeCompar
+ {
+ 	TYPE_COMP_OTHER,
+ 	TYPE_COMP_INT4,
+ 	TYPE_COMP_INT8,
+ 	TYPE_COMP_FLOAT4,
+ 	TYPE_COMP_FLOAT8,
+ 	TYPE_COMP_FULL_SPECIALISATION
+ } TypeCompar;
+ 
  typedef struct SortSupportData *SortSupport;
  
  typedef struct SortSupportData
*************** typedef struct SortSupportData
*** 96,103 ****
  	int			(*comparator) (Datum x, Datum y, SortSupport ssup);
  
  	/*
! 	 * Additional sort-acceleration functions might be added here later.
  	 */
  } SortSupportData;
  
  
--- 140,168 ----
  	int			(*comparator) (Datum x, Datum y, SortSupport ssup);
  
  	/*
! 	 * This specialization function pointer is sometimes used as an alternative
! 	 * to the standard qsort_arg, when it has been determined that we can
! 	 * benefit from various per-type and per number of sort key performance
! 	 * optimizations.
! 	 *
! 	 * Often, this will simply entail using a variant that inlines the comparator
! 	 * that was previously used (through a pointer) by the highly generic
! 	 * qsort_arg, such as comparetup_heap, which encapsulate the details of
! 	 * performing a comparison on the class of tuple in question, and are where
! 	 * the call is made to MySortSupport->comparator.
! 	 *
! 	 * In some other cases, particularly with scalar datatypes that are assumed
! 	 * to be sorted far more frequently in practice, the specialization goes so
! 	 * far as to inline the comparator proper from within the tuple-class
! 	 * encapsulating comparator.
! 	 */
! 	void (*qsort_arg_spec)(void *a, size_t n, size_t es, void *arg);
! 
! 	/*
! 	 * Which type-specific comparator proper (which is part of a fully inlined
! 	 * specialization) can be safely used, if any?
  	 */
+ 	TypeCompar usable_compar;
  } SortSupportData;
  
  
diff --git a/src/include/utils/template_qsort_arg.h b/src/include/utils/template_qsort_arg.h
new file mode 100644
index ...28b6958
*** a/src/include/utils/template_qsort_arg.h
--- b/src/include/utils/template_qsort_arg.h
***************
*** 0 ****
--- 1,322 ----
+ /*-------------------------------------------------------------------------
+  *	template_qsort_arg.h: "template" version of qsort_arg.c
+  *
+  *	This version of qsort_arg is exclusively used within tuplesort.c to	more
+  *	efficiently sort common types such as integers and floats. In providing this
+  *	version, we seek to take advantage of compile-time	optimizations, in
+  *	particular, the inlining of comparators. In some cases the "tuple class"
+  *	encapsulating comparator (that is, the particular comparator used directly
+  *	here) is just inlined, when a full specialization is unavailable. In other
+  *	cases, the comparator proper is also inlined, for full integration of the
+  *	sort routine. There are variant specializations for cases with only a single
+  *	sortkey, and cases with multiple sortkeys.
+  *
+  *	The TEMPLATE_QSORT_ARG() macro generates an inlining variant (for sorts with
+  *	a single sortKey) and non-inlining variant for sorts with multiple sortKeys.
+  *
+  *	We rely on various function declarations, and indeed partially duplicate
+  *	some code from tuplesort.c, so this file should be considered private to
+  *	that module, rather than a generic piece of infrastructure. However, it is
+  *	reasonable for clients to use this infrastructure to generate their own
+  *	sorting specializations outside of that translation unit.
+  *
+  *	Clients of this code may wish to use it to generate their own full sorting
+  *	specialization, before setting a pointer to the specialization within a
+  *	custom sortsupport function.
+  *
+  *	CAUTION: if you change this file, see also qsort_arg.c as well as qsort.c.
+  *	qsort_arg.c should be considered authoritative.
+  *
+  *	Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
+  *	Portions Copyright (c) 1994, Regents of the University of California
+  *
+  *	src/include/utils/template_qsort_arg.h
+  *-------------------------------------------------------------------------
+  */
+ #include "c.h"
+ 
+ #define swapcode(TYPE, parmi, parmj, n)		\
+ do {										\
+ 	size_t i = (n) / sizeof (TYPE);			\
+ 	TYPE *pi = (TYPE *)(void *)(parmi);		\
+ 	TYPE *pj = (TYPE *)(void *)(parmj);		\
+ 	do {									\
+ 		TYPE	t = *pi;					\
+ 		*pi++ = *pj;						\
+ 		*pj++ = t;							\
+ 		} while (--i > 0);					\
+ } while (0)
+ 
+ #define SWAPINIT(a, es) swaptype = ((char *)(a) - (char *)0) % sizeof(long) || \
+ 	(es) % sizeof(long) ? 2 : (es) == sizeof(long)? 0 : 1;
+ 
+ #define thistype_vecswap(a, b, n)								\
+ 	if ((n) > 0) inl_swapfunc((a), (b), (size_t)(n), swaptype)
+ 
+ #define thistype_swap(a, b)								\
+ if (swaptype == 0) {									\
+ 		long t = *(long *)(void *)(a);					\
+ 		*(long *)(void *)(a) = *(long *)(void *)(b);	\
+ 		*(long *)(void *)(b) = t;						\
+ 	} else												\
+ 		inl_swapfunc(a, b, es, swaptype)
+ 
+ inline static void
+ inl_swapfunc(char *a, char *b, size_t n, int swaptype)
+ {
+ 	if (swaptype <= 1)
+ 		swapcode(long, a, b, n);
+ 	else
+ 		swapcode(char, a, b, n);
+ }
+ 
+ /*
+  * This macro manufactures a type-specific implementation of qsort_arg with
+  * the comparator, COMPAR, known at compile time. COMPAR is typically an
+  * inline function.
+  *
+  * COMPAR should take as its arguments two Datums, and return an int, in
+  * line with standard qsort convention.
+  *
+  * We have void* parameters for TYPE##comparetup_inline just to shut up the compiler.
+  * They could be SortTuple pointers instead, but that would make it more
+  * difficult to keep template_qsort_arg.h consistent with tuplesort.c.
+  */
+ 
+ #define DO_TEMPLATE_QSORT_ARG(TYPE, COMPAR, SPEC_VAR, ADD_CODE, OUT_COMPAR)			\
+ void TYPE##_##SPEC_VAR##_qsort_arg(void *a, size_t n, size_t es, void *arg);		\
+ 																					\
+ inline static int32																	\
+ TYPE##SPEC_VAR##AppFunc(Datum datum1, bool isnull1, Datum datum2, bool isnull2,		\
+ 							SortSupport	sortKey)									\
+ {																					\
+ 	int32		compare;															\
+ 	if (isnull1)																	\
+ 	{																				\
+ 		if (isnull2)																\
+ 			compare = 0;		/* NULL "=" NULL */									\
+ 		else if (sortKey->ssup_nulls_first)											\
+ 			compare = -1;		/* NULL "<" NOT_NULL */								\
+ 		else																		\
+ 			compare = 1;		/* NULL ">" NOT_NULL */								\
+ 	}																				\
+ 	else if (isnull2)																\
+ 	{																				\
+ 		if (sortKey->ssup_nulls_first)												\
+ 			compare = 1;		/* NOT_NULL ">" NULL */								\
+ 		else																		\
+ 			compare = -1;		/* NOT_NULL "<" NULL */								\
+ 	}																				\
+ 	else																			\
+ 	{																				\
+ 		compare = COMPAR(datum1, datum2);											\
+ 																					\
+ 		if (sortKey->ssup_reverse)													\
+ 			compare = -compare;														\
+ 	}																				\
+ 	return compare;																	\
+ }																					\
+ 																					\
+ /*																					\
+  * Note that this is heavily based on comparetup_inline; the two should be kept		\
+  * consistent.																		\
+  */																					\
+ pg_always_inline static int															\
+ TYPE##SPEC_VAR##comparetup_inline(const void *a, const void *b,						\
+ 		Tuplesortstate *state)														\
+ {																					\
+ 	const SortTuple* aT = a;														\
+ 	const SortTuple* bT = b;														\
+ 	int32		compare;															\
+ 	SortSupport	sortKey = state->sortKeys;											\
+ 																					\
+ 	/* Allow interrupting long sorts */												\
+ 	CHECK_FOR_INTERRUPTS();															\
+ 	compare = TYPE##SPEC_VAR##AppFunc(aT->datum1, aT->isnull1, bT->datum1,			\
+ 										bT->isnull1, sortKey);						\
+ 	if (compare != 0)																\
+ 		return compare;																\
+ 	/* Additional code for variants with more than one sortkey */					\
+ 	ADD_CODE																		\
+ 	return 0;																		\
+ }																					\
+ 																					\
+ inline static char *											\
+ TYPE##SPEC_VAR##med3(char *a, char *b, char *c, void *arg)		\
+ {																\
+ 	return OUT_COMPAR(a, b, arg) < 0 ?							\
+ 		(OUT_COMPAR(b, c, arg) < 0 ?							\
+ 					b : (OUT_COMPAR(a, c, arg) < 0 ? c : a))	\
+ 		: (OUT_COMPAR(b, c, arg) > 0 ?							\
+ 					b : (OUT_COMPAR(a, c, arg) < 0 ? a : c));	\
+ }																\
+ 																\
+ void																		\
+ TYPE##_##SPEC_VAR##_qsort_arg(void *a, size_t n, size_t es, void *arg)		\
+ {																			\
+ 	char	   *pa,															\
+ 			   *pb,															\
+ 			   *pc,															\
+ 			   *pd,															\
+ 			   *pl,															\
+ 			   *pm,															\
+ 			   *pn;															\
+ 	int			d,															\
+ 				r,															\
+ 				swaptype,													\
+ 				presorted;													\
+ 																			\
+ loop:SWAPINIT(a, es);														\
+ 	if (n < 7)																\
+ 	{																		\
+ 		for (pm = (char *) a + es; pm < (char *) a + n * es; pm += es)		\
+ 			for (pl = pm; pl > (char *) a &&								\
+ 					OUT_COMPAR(pl - es, pl, arg) > 0;						\
+ 				 pl -= es)													\
+ 				thistype_swap(pl, pl - es);									\
+ 		return;																\
+ 	}																		\
+ 	presorted = 1;															\
+ 	for (pm = (char *) a + es; pm < (char *) a + n * es; pm += es)			\
+ 	{																		\
+ 		if (OUT_COMPAR(pm - es, pm, arg) > 0)								\
+ 		{																	\
+ 			presorted = 0;													\
+ 			break;															\
+ 		}																	\
+ 	}																		\
+ 	if (presorted)															\
+ 		return;																\
+ 	pm = (char *) a + (n / 2) * es;											\
+ 	if (n > 7)																\
+ 	{																		\
+ 		pl = (char *) a;													\
+ 		pn = (char *) a + (n - 1) * es;										\
+ 		if (n > 40)															\
+ 		{																	\
+ 			d = (n / 8) * es;												\
+ 			pl = TYPE##SPEC_VAR##med3(pl, pl + d, pl + 2 * d, arg);			\
+ 			pm = TYPE##SPEC_VAR##med3(pm - d, pm, pm + d, arg);				\
+ 			pn = TYPE##SPEC_VAR##med3(pn - 2 * d, pn - d, pn, arg);			\
+ 		}																	\
+ 		pm = TYPE##SPEC_VAR##med3(pl, pm, pn, arg);							\
+ 	}																		\
+ 	thistype_swap(a, pm);													\
+ 	pa = pb = (char *) a + es;												\
+ 	pc = pd = (char *) a + (n - 1) * es;									\
+ 	for (;;)																\
+ 	{																		\
+ 		while (pb <= pc &&													\
+ 					(r = OUT_COMPAR(pb, a, arg)) <= 0)						\
+ 		{																	\
+ 			if (r == 0)														\
+ 			{																\
+ 				thistype_swap(pa, pb);										\
+ 				pa += es;													\
+ 			}																\
+ 			pb += es;														\
+ 		}																	\
+ 		while (pb <= pc &&													\
+ 				(r = OUT_COMPAR(pc, a, arg)) >= 0)							\
+ 		{																	\
+ 			if (r == 0)														\
+ 			{																\
+ 				thistype_swap(pc, pd);										\
+ 				pd -= es;													\
+ 			}																\
+ 			pc -= es;														\
+ 		}																	\
+ 		if (pb > pc)														\
+ 			break;															\
+ 		thistype_swap(pb, pc);												\
+ 		pb += es;															\
+ 		pc -= es;															\
+ 	}																		\
+ 	pn = (char *) a + n * es;												\
+ 	r = Min(pa - (char *) a, pb - pa);										\
+ 	thistype_vecswap(a, pb - r, r);											\
+ 	r = Min(pd - pc, pn - pd - es);											\
+ 	thistype_vecswap(pb, pn - r, r);										\
+ 	if ((r = pb - pa) > es)													\
+ 		TYPE##_##SPEC_VAR##_qsort_arg(a, r / es, es, arg);					\
+ 	if ((r = pd - pc) > es)													\
+ 	{																		\
+ 		/* Iterate rather than recurse to save stack space */				\
+ 		a = pn - r;															\
+ 		n = r / es;															\
+ 		goto loop;															\
+ 	}																		\
+ }
+ 
+ /*
+  * This code becomes part of the comparator meta-function for the "reg"
+  * specialization variant of each datatype-specific specialization.
+  *
+  * Note that this is heavily based on tuplesort_comparetup_heap; the two should
+  * be kept consistent.
+  *
+  * We can handle multiple sortKeys, but the function generally will not be
+  * inlined directly when sorting. We'll try and use compile time knowledge of
+  * the comparator for later sortKeys, on the basis that that has been determined
+  * to be a win frequently enough to justify the overhead.
+  *
+  * Modern compilers are not inclined to pay too much attention to the inline
+  * keyword, and indeed inline at the callsite granularity, rather than the
+  * function granularity. It should not be assumed that the second call to the
+  * "inline" comparator here will result in a second copy of the comparator.
+  */
+ 
+ #define MULT_ADDITIONAL_CODE(COMPAR)										\
+ {																			\
+ 	HeapTupleData ltup;														\
+ 	HeapTupleData rtup;														\
+ 	TupleDesc	tupDesc;													\
+ 	int			nkey;														\
+ 																			\
+ 	/* Compare additional sort keys */										\
+ 	ltup.t_len =															\
+ 		((MinimalTuple) aT->tuple)->t_len + MINIMAL_TUPLE_OFFSET;			\
+ 	ltup.t_data =															\
+ 		(HeapTupleHeader) ((char *) aT->tuple - MINIMAL_TUPLE_OFFSET);		\
+ 	rtup.t_len =															\
+ 		((MinimalTuple) bT->tuple)->t_len + MINIMAL_TUPLE_OFFSET;			\
+ 	rtup.t_data =															\
+ 		(HeapTupleHeader) ((char *) bT->tuple - MINIMAL_TUPLE_OFFSET);		\
+ 	tupDesc = state->tupDesc;												\
+ 	sortKey++;																\
+ 	for (nkey = 1; nkey < state->nKeys; nkey++, sortKey++)					\
+ 	{																		\
+ 		AttrNumber	attno = sortKey->ssup_attno;							\
+ 		Datum		datum1,													\
+ 					datum2;													\
+ 		bool		isnull1,												\
+ 					isnull2;												\
+ 																			\
+ 		datum1 = heap_getattr(&ltup, attno, tupDesc, &isnull1);				\
+ 		datum2 = heap_getattr(&rtup, attno, tupDesc, &isnull2);				\
+ 																			\
+ 		if (sortKey->usable_compar == state->sortKeys->usable_compar)		\
+ 			compare = COMPAR(datum1, isnull1,								\
+ 								datum2, isnull2,							\
+ 								sortKey);									\
+ 		else																\
+ 			compare = ApplySortComparator(datum1, isnull1,					\
+ 										  datum2, isnull2,					\
+ 										  sortKey);							\
+ 		if (compare != 0)													\
+ 			return compare;													\
+ 	}																		\
+ }
+ 
+ #define SING_ADDITIONAL_CODE Assert(state->nKeys == 1);
+ 
+ /*
+  * Manufacture inlining variant for nKeys=1 case, and non-inlining variant
+  * for nKeys > 1 case
+  */
+ #define TEMPLATE_QSORT_ARG_HEAP(TYPE, COMPAR)								\
+ DO_TEMPLATE_QSORT_ARG(TYPE, COMPAR, inlheap,								\
+ 		SING_ADDITIONAL_CODE, TYPE##inlheapcomparetup_inline)				\
+ DO_TEMPLATE_QSORT_ARG(TYPE, COMPAR, regheap,								\
+ 		MULT_ADDITIONAL_CODE(TYPE##regheapAppFunc),							\
+ 			TYPE##regheapcomparetup_inline)
diff --git a/src/port/qsort.c b/src/port/qsort.c
new file mode 100644
index 8e2c6d9..d1981d6
*** a/src/port/qsort.c
--- b/src/port/qsort.c
***************
*** 7,13 ****
   *	  Remove ill-considered "swap_cnt" switch to insertion sort,
   *	  in favor of a simple check for presorted input.
   *
!  *	CAUTION: if you change this file, see also qsort_arg.c
   *
   *	src/port/qsort.c
   */
--- 7,14 ----
   *	  Remove ill-considered "swap_cnt" switch to insertion sort,
   *	  in favor of a simple check for presorted input.
   *
!  *	CAUTION: if you change this file, see also qsort_arg.c and
!  *	template_qsort_arg.h
   *
   *	src/port/qsort.c
   */
diff --git a/src/port/qsort_arg.c b/src/port/qsort_arg.c
new file mode 100644
index 28d1894..0ab6198
*** a/src/port/qsort_arg.c
--- b/src/port/qsort_arg.c
***************
*** 7,13 ****
   *	  Remove ill-considered "swap_cnt" switch to insertion sort,
   *	  in favor of a simple check for presorted input.
   *
!  *	CAUTION: if you change this file, see also qsort.c
   *
   *	src/port/qsort_arg.c
   */
--- 7,13 ----
   *	  Remove ill-considered "swap_cnt" switch to insertion sort,
   *	  in favor of a simple check for presorted input.
   *
!  *	CAUTION: if you change this file, see also qsort.c and template_qsort_arg.h
   *
   *	src/port/qsort_arg.c
   */

#31

robertmhaas@gmail.com

almost 14 years ago

In reply to: Peter Geoghegan (#30)

Re: Progress on fast path sorting, btree index creation time

On Fri, Jan 27, 2012 at 3:33 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:

Patch is attached. I have not changed the duplicate functions. This is
because I concluded that it was the lesser of two evils to have to get
the template to generate both declarations in the header file, and
definitions in the .c file - that seemed particularly obscure. We're
never going to have to expose/duplicate any more comparators anyway.
Do you agree?

Not really. You don't really need macros to generate the prototypes;
you could just write them out longhand.

I think there's a mess of naming confusion in here, though, as perhaps
best illlustrated by this macro definition:

#define TEMPLATE_QSORT_ARG_HEAP(TYPE, COMPAR) \
DO_TEMPLATE_QSORT_ARG(TYPE, COMPAR, inlheap, \
SING_ADDITIONAL_CODE, TYPE##inlheapcomparetup_inline) \
DO_TEMPLATE_QSORT_ARG(TYPE, COMPAR, regheap, \
MULT_ADDITIONAL_CODE(TYPE##regheapAppFunc), \
TYPE##regheapcomparetup_inline)

The idea here is that when we have only a single sort key, we include
SING_ADDITIONAL_CODE in the tuple comparison function, whereas when we
have more than one, we instead include MULT_ADDITIONAL_CODE. Right
there, I think we have a naming problem, because abbreviating "single"
to "sing" and multiple to "mult" is less than entirely clear. For a
minute or two I was trying to figure out whether our sorting code was
musically inclined, and I'm a native english speaker. But then we
switch to another set of terminology completely for the generated
functions: inlheap for the single-key case, and regheap for the
multiple-key case. I find that even more confusing.

I think we ought to get rid of this:

+typedef enum TypeCompar
+{
+       TYPE_COMP_OTHER,
+       TYPE_COMP_INT4,
+       TYPE_COMP_INT8,
+       TYPE_COMP_FLOAT4,
+       TYPE_COMP_FLOAT8,
+       TYPE_COMP_FULL_SPECIALISATION
+} TypeCompar;

Instead, just modify SortSupportData to have two function pointers as
members, one for the single-key case and another for the multiple-key
case, and have the sortsupport functions initialize them to the actual
functions that should be called. The layer of indirection, AFAICS,
serves no purpose.

It's pretty easy to remove a specialisation at any time - just remove
less than 10 lines of code. It's also pretty difficult to determine,
with everyone's absolute confidence, where the right balance lies.
Perhaps the sensible thing to do is to not be so conservative in what
we initially commit, while clearly acknowledging that we may not have
the balance right, and that it may have to change. We then have the
entire beta part of the cycle in which to decide to roll back from
that position, without any plausible downside. If, on the other hand,
we conservatively lean towards fewer specialisations in the initial
commit, no one will complain about the lack of an improvement in
performance that they never had.

Eh, really? Typically when we do something good, the wolves are
howling at the door to make it work in more cases.

I think that possibly the one remaining blocker to tentatively
committing this with all specialisations intact is that I haven't
tested this on Windows, as I don't currently have access to a Windows
development environment. I have set one up before, but it's a huge
pain. Can anyone help me out?

This doesn't strike me as terribly OS-dependent, unless by that we
mean compiler-dependent.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#32

peter@2ndquadrant.com

almost 14 years ago

In reply to: Robert Haas (#31)

Re: Progress on fast path sorting, btree index creation time

On 31 January 2012 19:47, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jan 27, 2012 at 3:33 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:

Patch is attached. I have not changed the duplicate functions. This is
because I concluded that it was the lesser of two evils to have to get
the template to generate both declarations in the header file, and
definitions in the .c file - that seemed particularly obscure. We're
never going to have to expose/duplicate any more comparators anyway.
Do you agree?

Not really. You don't really need macros to generate the prototypes;
you could just write them out longhand.

Well, since that would have the advantage of forcing the client of
those macros to explicitly acknowledge what specialisations they were
getting in order to get a warning-free build, that might be a better
approach (though uncalled static functions are discarded anyway).
However, I don't really see how you expect me to make the
comparator-proper functions within nbtcompare.c inline without some
duplication. How can I have inline functions in that .c file, while
still having those functions callable from other TUs, while avoiding
duplication/moving the functions? I suppose that I could do some trick
with macros, but once again I believe that the cure is worse than the
disease - it it really such a big deal to duplicate a handful of
trivial functions, given than we're never going to have to expand that
number?

I think there's a mess of naming confusion in here, though, as perhaps
best illlustrated by this macro definition:

#define TEMPLATE_QSORT_ARG_HEAP(TYPE, COMPAR) \
DO_TEMPLATE_QSORT_ARG(TYPE, COMPAR, inlheap, \
SING_ADDITIONAL_CODE, TYPE##inlheapcomparetup_inline) \
DO_TEMPLATE_QSORT_ARG(TYPE, COMPAR, regheap, \
MULT_ADDITIONAL_CODE(TYPE##regheapAppFunc), \
TYPE##regheapcomparetup_inline)

The idea here is that when we have only a single sort key, we include
SING_ADDITIONAL_CODE in the tuple comparison function, whereas when we
have more than one, we instead include MULT_ADDITIONAL_CODE. Right
there, I think we have a naming problem, because abbreviating "single"
to "sing" and multiple to "mult" is less than entirely clear. For a
minute or two I was trying to figure out whether our sorting code was
musically inclined, and I'm a native english speaker. But then we
switch to another set of terminology completely for the generated
functions: inlheap for the single-key case, and regheap for the
multiple-key case. I find that even more confusing.

I'm happy to clean that up, though I think that when you try and
browbeat C into the role of supporting generic programming, confusing
code is more or less inevitable.

I think we ought to get rid of this:
+typedef enum TypeCompar
+{
+       TYPE_COMP_OTHER,
+       TYPE_COMP_INT4,
+       TYPE_COMP_INT8,
+       TYPE_COMP_FLOAT4,
+       TYPE_COMP_FLOAT8,
+       TYPE_COMP_FULL_SPECIALISATION
+} TypeCompar;
Instead, just modify SortSupportData to have two function pointers as
members, one for the single-key case and another for the multiple-key
case, and have the sortsupport functions initialize them to the actual
functions that should be called. The layer of indirection, AFAICS,
serves no purpose.

It anticipates a time when the system will have both btree and heap
variants of the specialisations. It also serves to usefully document
the intent of the optimisation - the most direct interface isn't
necessarily the best. That said, I really am looking for the path of
least resistance towards getting this committed at this stage, so if
you still think it's a bad idea, I won't complain if you modify the
patch to have multiple function pointers as you described.

It's pretty easy to remove a specialisation at any time - just remove
less than 10 lines of code. It's also pretty difficult to determine,
with everyone's absolute confidence, where the right balance lies.
Perhaps the sensible thing to do is to not be so conservative in what
we initially commit, while clearly acknowledging that we may not have
the balance right, and that it may have to change. We then have the
entire beta part of the cycle in which to decide to roll back from
that position, without any plausible downside. If, on the other hand,
we conservatively lean towards fewer specialisations in the initial
commit, no one will complain about the lack of an improvement in
performance that they never had.

Eh, really? Typically when we do something good, the wolves are
howling at the door to make it work in more cases.

Well, if anyone was complaining about a regression, because the
optimisation represented a net loss for their reasonable use-case,
then clearly we wouldn't be doing well, so we'd have to reconsider our
position, and no sensible wolf is going to argue with that. Obviously
it's not possible to prove that there couldn't be some kind of
regression under some set of circumstances in a water-tight fashion.
However, I believe that I have argued well that on balance, it's worth
maintaining the full set of specialisations. This is one case where
committing code and working out if there are problems makes sense,
because: 1) There very probably aren't (you yourself were not
surprised that a regression couldn't be demonstrated), and 2) if there
are, we can withdraw from having so many specialisations in small
increments, rather than having to throw everything out as might have
been the case in the past with other proposed performance patches.

I think that possibly the one remaining blocker to tentatively
committing this with all specialisations intact is that I haven't
tested this on Windows, as I don't currently have access to a Windows
development environment. I have set one up before, but it's a huge
pain. Can anyone help me out?

This doesn't strike me as terribly OS-dependent, unless by that we
mean compiler-dependent.

What concerns me is how the pg_always_inline macro behaves when
building with MSVC. While I have no reason to doubt that the effect
should be similar there, I would like to have that verified. Building
with MSVC is simple if you're only building release source packages,
but when you bring Flex + Bison into the equation, it becomes rather
difficult, which is why I asked for help there. You are generally
right to say that it is not terribly OS-dependent though, since, as
I've already shown, very portable software like the GNU C library uses
tricks like pg_always_inline extensively, both in code intended for a
particular arch, and code intended for all. Even where
pg_always_inline doesn't have an underlying compiler-specific way of
compelling inlining (it just falls back on inline, which is probably
now a no-op on this obscure platform anyway), that's just improving
performance. The usefulness of this optimisation does not hinge on
pg_always_inline in any way.

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

#33

Jim Nasby

jim@nasby.net

almost 14 years ago

In reply to: Robert Haas (#25)

Re: Progress on fast path sorting, btree index creation time

On Jan 26, 2012, at 9:32 PM, Robert Haas wrote:

But if we want to put it on a diet, the first thing I'd probably be
inclined to lose is the float4 specialization. Some members of the
audience will recall that I take dim view of floating point arithmetic
generally, but I'll concede that there are valid reasons for using
float8. I have a harder time coming up with a good reason to use
float4 - ever, for anything you care about. So I would be inclined to
think that if we want to trim this back a bit, maybe that's the one to
let go. If we want to be even more aggressive, the next thing I'd
probably lose is the optimization of multiple sortkey cases, on the
theory that single sort keys are probably by far the most common
practical case.

I do find float4 to be useful, though it's possible that my understanding is flawed…

We end up using float to represent ratios in our database; things that really, honest to God do NOT need to be exact.

In most cases, 7 digits of precision (which AFAIK is what you're guaranteed with float4) is plenty, so we use float4 rather than bloat the database (though, since we're on 64bit hardware I guess that distinction is somewhat moot…).

Is there something I'm missing that would make float4 useless as compared to float8?
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net

#34

Alvaro Herrera

alvherre@commandprompt.com

almost 14 years ago

In reply to: Jim Nasby (#33)

Re: Progress on fast path sorting, btree index creation time

Excerpts from Jim Nasby's message of mié feb 01 19:12:58 -0300 2012:

On Jan 26, 2012, at 9:32 PM, Robert Haas wrote:

But if we want to put it on a diet, the first thing I'd probably be
inclined to lose is the float4 specialization. Some members of the
audience will recall that I take dim view of floating point arithmetic
generally, but I'll concede that there are valid reasons for using
float8. I have a harder time coming up with a good reason to use
float4 - ever, for anything you care about. So I would be inclined to
think that if we want to trim this back a bit, maybe that's the one to
let go. If we want to be even more aggressive, the next thing I'd
probably lose is the optimization of multiple sortkey cases, on the
theory that single sort keys are probably by far the most common
practical case.

I do find float4 to be useful, though it's possible that my understanding is flawed…

We end up using float to represent ratios in our database; things that really, honest to God do NOT need to be exact.

But then, do you sort using those ratios?

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#35

ktm@rice.edu

almost 14 years ago

In reply to: Jim Nasby (#33)

Re: Progress on fast path sorting, btree index creation time

On Wed, Feb 01, 2012 at 04:12:58PM -0600, Jim Nasby wrote:

On Jan 26, 2012, at 9:32 PM, Robert Haas wrote:

But if we want to put it on a diet, the first thing I'd probably be
inclined to lose is the float4 specialization. Some members of the
audience will recall that I take dim view of floating point arithmetic
generally, but I'll concede that there are valid reasons for using
float8. I have a harder time coming up with a good reason to use
float4 - ever, for anything you care about. So I would be inclined to
think that if we want to trim this back a bit, maybe that's the one to
let go. If we want to be even more aggressive, the next thing I'd
probably lose is the optimization of multiple sortkey cases, on the
theory that single sort keys are probably by far the most common
practical case.

I do find float4 to be useful, though it's possible that my understanding is flawed…

We end up using float to represent ratios in our database; things that really, honest to God do NOT need to be exact.

In most cases, 7 digits of precision (which AFAIK is what you're guaranteed with float4) is plenty, so we use float4 rather than bloat the database (though, since we're on 64bit hardware I guess that distinction is somewhat moot…).

Is there something I'm missing that would make float4 useless as compared to float8?
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net

If the values stored are float4, it would be nice to have that fast-path
sort available too. The cases where I have used float4 values in the past,
I absolutely did not need any of the float8 baggage and in my case, using
the actual float4 comparison operator resulted in a significant time savings
over the normal float8. This could be processor specific, but it would be
worth testing before throwing it out.

Regards,
Ken

#36

bruce@momjian.us

almost 14 years ago

In reply to: Robert Haas (#29)

Re: Progress on fast path sorting, btree index creation time

On Fri, Jan 27, 2012 at 09:37:37AM -0500, Robert Haas wrote:

On Fri, Jan 27, 2012 at 9:27 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote:

Well, I don't think it's all that subjective - it's more the case that
it is just difficult, or it gets that way as you consider more
specialisations.

Sure it's subjective. Two well-meaning people could have different
opinions without either of them being "wrong". If you do a lot of
small, in-memory sorts, more of this stuff is going to seem worthwhile
than if you don't.

As for what types/specialisations may not make the cut, I'm
increasingly convinced that floats (in the following order: float4,
float8) should be the first to go. Aside from the fact that we cannot
use their specialisations for anything like dates and timestamps,
floats are just way less useful than integers in the context of
database applications, or at least those that I've been involved with.
As important as floats are in the broad context of computing, it's
usually only acceptable to store data in a database as floats within
scientific applications, and only then when their limitations are
well-understood and acceptable. I think we've all heard anecdotes at
one time or another, involving their limitations not being well
understood.

While we're waiting for anyone else to weigh in with an opinion on the
right place to draw the line here, do you want to post an updated
patch with the changes previously discussed?

Well, I think we have to ask not only how many people are using
float4/8, but how many people are sorting or creating indexes on them.
I think it would be few and perhaps should be eliminated.

Peter Geoghegan obviously has done some serious work in improving
sorting, and worked well with the community process. He has done enough
analysis that I am hard-pressed to see how we would get similar
improvement using a different method, so I think it comes down to
whether we want the 28% speedup by adding 55k (1%) to the binary.

I think Peter has shown us how to get that, and what it will cost --- we
just need to decide now whether it is worth it. What I am saying is
there probably isn't a cheaper way to get that speedup, either now or in
the next few years. (COPY might need similar help for speedups.)

I believe this is a big win and well worth the increased binary size
because the speed up is significant, and because it is of general
usefulness for a wide range of queries. Either of these would be enough
to justify the additional 1% size, but both make it an easy decision for
me.

FYI, I believe COPY needs similar optimizations; we have gotten repeated
complaints about its performance and this method of optmization might
also be our only option.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#37

bruce@momjian.us

almost 14 years ago

In reply to: Bruce Momjian (#36)

Re: Progress on fast path sorting, btree index creation time

On Mon, Feb 06, 2012 at 04:19:07PM -0500, Bruce Momjian wrote:

Peter Geoghegan obviously has done some serious work in improving
sorting, and worked well with the community process. He has done enough
analysis that I am hard-pressed to see how we would get similar
improvement using a different method, so I think it comes down to
whether we want the 28% speedup by adding 55k (1%) to the binary.

I think Peter has shown us how to get that, and what it will cost --- we
just need to decide now whether it is worth it. What I am saying is
there probably isn't a cheaper way to get that speedup, either now or in
the next few years. (COPY might need similar help for speedups.)

I believe this is a big win and well worth the increased binary size
because the speed up is significant, and because it is of general
usefulness for a wide range of queries. Either of these would be enough
to justify the additional 1% size, but both make it an easy decision for
me.

FYI, I believe COPY needs similar optimizations; we have gotten repeated
complaints about its performance and this method of optmization might
also be our only option.

Sorry, that was wordy. What I am saying is that years ago we did
hot-spot optimization for storage and tuple access using macros. We
have looked for cheap optimizations for sort (and COPY) for years, but
haven't found it. If we want optimizations in these areas, we are
probably going to need to do the same sort of hot-spot optimization we
did earlier, and Peter's work is an excellent candidate for that.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#38

Jim "Decibel!" Nasby

decibel@decibel.org

almost 14 years ago

In reply to: Bruce Momjian (#36)

Re: Progress on fast path sorting, btree index creation time

On 2/6/12 3:19 PM, Bruce Momjian wrote:

While we're waiting for anyone else to weigh in with an opinion on the
right place to draw the line here, do you want to post an updated
patch with the changes previously discussed?

Well, I think we have to ask not only how many people are using
float4/8, but how many people are sorting or creating indexes on them.
I think it would be few and perhaps should be eliminated.

I agree that it's probably pretty unusual to index floats. My objection
was on the assumption that float8 is valid but float4 isn't. If we are
going to provide a fast-path for one then we should do it for both if
for no other reason than least surprise.

--
Jim C. Nasby, Database Architectjim@nasby.net
512.569.9461 (cell)http://jim.nasby.net

#39

peter@2ndquadrant.com

almost 14 years ago

In reply to: Bruce Momjian (#36)

Re: Progress on fast path sorting, btree index creation time

On 6 February 2012 21:19, Bruce Momjian <bruce@momjian.us> wrote:

Peter Geoghegan obviously has done some serious work in improving
sorting, and worked well with the community process.

Thank you for acknowledging that.

It's unfortunate that C does not support expressing these kinds of
ideas in a more natural way.

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

#40

bruce@momjian.us

almost 14 years ago

In reply to: Peter Geoghegan (#39)

Re: Progress on fast path sorting, btree index creation time

On Mon, Feb 06, 2012 at 10:49:10PM +0000, Peter Geoghegan wrote:

On 6 February 2012 21:19, Bruce Momjian <bruce@momjian.us> wrote:

Peter Geoghegan obviously has done some serious work in improving
sorting, and worked well with the community process.

Thank you for acknowledging that.

It's unfortunate that C does not support expressing these kinds of
ideas in a more natural way.

Yes, it is a problem, and a benefit. We have avoided C++ because these
types of trade-offs that we are discussing are often done invisibly, so
we can't make the decision ourselves.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#41

bruce@momjian.us

almost 14 years ago

In reply to: Bruce Momjian (#40)

Re: Progress on fast path sorting, btree index creation time

On Mon, Feb 06, 2012 at 06:43:04PM -0500, Bruce Momjian wrote:

On Mon, Feb 06, 2012 at 10:49:10PM +0000, Peter Geoghegan wrote:

On 6 February 2012 21:19, Bruce Momjian <bruce@momjian.us> wrote:

Peter Geoghegan obviously has done some serious work in improving
sorting, and worked well with the community process.

Thank you for acknowledging that.

It's unfortunate that C does not support expressing these kinds of
ideas in a more natural way.

Yes, it is a problem, and a benefit. We have avoided C++ because these
types of trade-offs that we are discussing are often done invisibly, so
we can't make the decision ourselves.

Let me add that while it is fine for languages like C++ to make these
decisions for application code automatically, operating systems and
high-performance databases developers prefer to make such decisions
explicitly, which is what we are discussing now.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#42

Jay Levitt

jay.levitt@gmail.com

almost 14 years ago

In reply to: Jim "Decibel!" Nasby (#38)

Re: Progress on fast path sorting, btree index creation time

Jim "Decibel!" Nasby wrote:

I agree that it's probably pretty unusual to index floats.

FWIW: Cubes and points are floats, right? So would spatial indexes benefit
from this optimization, or is it only raw floats?

Jay Levitt

#43

robertmhaas@gmail.com

almost 14 years ago

In reply to: Jay Levitt (#42)

Re: Progress on fast path sorting, btree index creation time

On Tue, Feb 7, 2012 at 2:39 PM, Jay Levitt <jay.levitt@gmail.com> wrote:

Jim "Decibel!" Nasby wrote:

I agree that it's probably pretty unusual to index floats.

FWIW: Cubes and points are floats, right? So would spatial indexes benefit
from this optimization, or is it only raw floats?

Cubes are not floats, nor are points.

In general, what's being proposed here is to make a separate copy of
quicksort for each of N datatypes, with the comparator function
inlined. In order for this to benefit multiple types, they'd have to
use the same set of machine instructions to compare values on disk.
So in general you can make this apply to as many datatypes as you
want: it's just that each one will require another copy of the
quicksort code in the binary. The more complex the comparator is, the
less you'll save, because the savings presumably come largely from
things like saving and resoring registers and pushing/popping stack
frames, and that's really only going to be material if the underlining
comparator is pretty darn cheap.

Actually, to be more precise, the patch is proposing to create TWO
copies of the quicksort code for each of N datatypes, one for the case
where there is but a single sort key and the other for the case where
there are multiple sort keys. Having a separate code for the single
sort key case saves more because it lets you get rid of a control loop
- you don't have to iterate down the list of keys, because there's
only one.

I've been skeptical of this patch for a number of reasons, the major
one of which is that most workloads spend only a very small amount of
time in doing qucksorts, and most quicksorts are of very small amounts
of data and therefore fast anyway. It is easy to construct an
artificial test case that does lots and lots of in-memory sorting, but
in real life I think that's not the great part of what people use
databases for. The time gets spent on things like groveling through
big tables or doing joins, and then maybe we sort the rows at the end,
but that's not where most of the time goes. It may be rightly pointed
out, of course, that a penny saved is a penny earned: the fact that
most people don't spend much time quicksorting doesn't mean that we
shouldn't try to make quicksorting as fast as possible. But there are
a couple of additional considerations.

First, the code is, at present, uglier than I would like. I mean to
spend some time trying to clean that up, but there are 102 patches in
this CommitFest (the number seems to keep slowly growing, despite the
deadline being three weeks gone) and this isn't the only one I care
about, so I haven't quite gotten there yet. But hopefully soon.
Second, there's a concern about binary bloat: duplicating lots of code
with different comparators inlined generates, well, a lot of machine
code. Of course, an 0.8% increase in the size of the resulting binary
is very unlikely to produce any measurable performance regression on
its own, but that's not really the issue. The issue is rather that we
could easily go nuts applying this technique in lots of different
places - or even just in one place to lots of different types. Let's
say that we find that even for types with very complex comparators, we
can get a 5% speedup on quick-sorting speed using this technique.
Should we, then, apply the technique to every type we have? Perhaps
the binary would grow by, I don't know, 10%. Maybe that's still not
enough to start hurting performance (or making packagers complain),
but surely it's getting there, and what happens when we discover that
similar speedups are possible using the same trick in five other
scenarios? I think it's a bit like going out to dinner and spending
$50. It's nicer than eating at home, and many people can afford to do
it occasionally, but if you do it every night, it gets expensive (and
possibly fattening).

So we need some principled way of deciding how much inlining is
reasonable, because I am 100% certain this is not going to be the last
time someone discovers that a massive exercise in inlining can yield a
nifty performance benefit in some case or another: index builds and
COPY have already been mentioned on this thread, and I expect that
people will find other cases as well. I'm not really sure what the
"budget" is - i.e. how much binary bloat we can afford to add - or how
many cases there are that can benefit, but the first isn't infinite
and the second is more than the first.

Having said all that, I am inclined to commit at least some portion of
this, but I wanted to knock off a few other patches that have been
lingering for a while first.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#44

bruce@momjian.us

almost 14 years ago

In reply to: Robert Haas (#43)

Re: Progress on fast path sorting, btree index creation time

On Tue, Feb 07, 2012 at 09:38:39PM -0500, Robert Haas wrote:

So we need some principled way of deciding how much inlining is
reasonable, because I am 100% certain this is not going to be the last
time someone discovers that a massive exercise in inlining can yield a
nifty performance benefit in some case or another: index builds and
COPY have already been mentioned on this thread, and I expect that
people will find other cases as well. I'm not really sure what the
"budget" is - i.e. how much binary bloat we can afford to add - or how
many cases there are that can benefit, but the first isn't infinite
and the second is more than the first.

Having said all that, I am inclined to commit at least some portion of
this, but I wanted to knock off a few other patches that have been
lingering for a while first.

One approach would be to only do a few types now, e.g. integers and
strings, and perhaps leave the others for later.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#45

peter@2ndquadrant.com

almost 14 years ago

In reply to: Robert Haas (#43)

Re: Progress on fast path sorting, btree index creation time

It doesn't necessarily matter if we increase the size of the postgres
binary by 10%, precisely because most of that is not going to be in
play from one instant to the next. I'm thinking, in particular, of
btree index specialisations, where it could make perfect sense to "go
crazy". You cannot have a reasonable discussion about such costs
without considering that they will perhaps never be paid, given any
plausible workload. That's why the percentage that the postgres binary
size has been shown to increase by really isn't pertinent at all. At
best, it's a weak proxy for such costs, assuming you don't have a
demonscene-like preoccupation with reducing binary size, and I don't
believe that we should.

It would be difficult for me to measure such things objectively, but
I'd speculate that the proprietary databases have much larger binaries
than ours, while having far fewer features, precisely because they
started applying tricks like this a long time ago. You could counter
that their code bases probably look terrible, and you'd have a point,
but so do I.

On 8 February 2012 02:38, Robert Haas <robertmhaas@gmail.com> wrote:

I've been skeptical of this patch for a number of reasons, the major
one of which is that most workloads spend only a very small amount of
time in doing qucksorts, and most quicksorts are of very small amounts
of data and therefore fast anyway. It is easy to construct an
artificial test case that does lots and lots of in-memory sorting, but
in real life I think that's not the great part of what people use
databases for.

Fair enough, but if that's true, then it's also true that the cost due
to cache marginalisation - the only cost that I think is worth
considering at all - is correspondingly a small fraction of that very
small amount of sorting.

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

#46

tgl@sss.pgh.pa.us

almost 14 years ago

In reply to: Peter Geoghegan (#45)

Re: Progress on fast path sorting, btree index creation time

Peter Geoghegan <peter@2ndquadrant.com> writes:

It doesn't necessarily matter if we increase the size of the postgres
binary by 10%, precisely because most of that is not going to be in
play from one instant to the next.

You've heard of swapping, no? Basically what I'm hearing from you is
total denial that binary bloat costs anything, and that just does not
square with reality. Even if the costs in performance terms are
negligible in many common situations (something that you've asserted
but without offering any proof), there are other costs; software
distribution for instance is definitely sensitive to size. As a Red Hat
person I've had to spend time fighting to keep Postgres included in the
minimum "does it fit on a DVD" distribution of RHEL/Fedora. That case
gets harder to make every year, and yet it's pretty critical to the
success of the project --- if you don't get distributed, you lose users.

IMO this patch is already well past the point of diminishing returns in
value-per-byte-added. I'd like to see it trimmed back to provide a fast
path for just single-column int4/int8/float4/float8 sorts. The other
cases aren't going to offer enough of a win to justify the code space.

regards, tom lane

#47

robertmhaas@gmail.com

almost 14 years ago

In reply to: Peter Geoghegan (#45)

Re: Progress on fast path sorting, btree index creation time

On Wed, Feb 8, 2012 at 8:33 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote:

It doesn't necessarily matter if we increase the size of the postgres
binary by 10%, precisely because most of that is not going to be in
play from one instant to the next.

As Tom says, that doesn't jive with my experience. If you add on
enough binary bloat, you will have more page faults. It's true (as I
think you pointed out upthread) that packing all the copies of
quicksort into the binary one after the other minimizes the effect on
other things, but it doesn't eliminate it entirely. If you've got
this:

...then a branch from the "random other stuff" section of the binary
to the "more other stuff" section of the binary may cost more. For
example, suppose the OS does X amount of readahead. By stuffing all
those copies of quicksort into the middle there, you increase the
chances that the page you need was beyond the readahead window. Or,
if it wasn't going to be in the readahead window either way, then you
increase the distance that the disk head needs to move to find the
required block.

These costs are very small, no question about it. They are almost
impossible to measure individually, in much the same way that the cost
of pollution or greenhouse gas emissions is difficult to measure. But
it's an error to assume that because the costs are individually small
that they will never add up to anything material. As long as you keep
insisting on that, it's hard to have a rational conversation. We can
reasonably debate the magnitude of the costs, but to assert that they
don't exist gets us nowhere. Suppose we could get a 10% speedup on
sin(numeric) by adding 40GB to the binary size. Would you be in favor
of that? Do you think it would hurt performance on any other
workloads? Would our packagers complain at all? Surely your same
argument would apply to that case in spades: anyone who is not using
the gigantic hard-coded lookup table will not pay any portion of the
cost of it.

It would be difficult for me to measure such things objectively, but
I'd speculate that the proprietary databases have much larger binaries
than ours, while having far fewer features, precisely because they
started applying tricks like this a long time ago. You could counter
that their code bases probably look terrible, and you'd have a point,
but so do I.

That might be true; I have no idea. There are probably lots of
reasons why their code bases look terrible, including a long history
of backward compatibility with now-defunct versions, a variety of
commercial pressures, and the fact that they don't have to take flak
in the public square for what their code looks like.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#48

robertmhaas@gmail.com

almost 14 years ago

In reply to: Tom Lane (#46)

Re: Progress on fast path sorting, btree index creation time

On Wed, Feb 8, 2012 at 9:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

IMO this patch is already well past the point of diminishing returns in
value-per-byte-added. I'd like to see it trimmed back to provide a fast
path for just single-column int4/int8/float4/float8 sorts. The other
cases aren't going to offer enough of a win to justify the code space.

I'm curious about how much we're gaining from the single-column
specializations vs. the type-specific specializations. I think I'm
going to go try to characterize that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#49

bruce@momjian.us

almost 14 years ago

In reply to: Peter Geoghegan (#45)

Re: Progress on fast path sorting, btree index creation time

On Wed, Feb 08, 2012 at 01:33:30PM +0000, Peter Geoghegan wrote:

It doesn't necessarily matter if we increase the size of the postgres
binary by 10%, precisely because most of that is not going to be in
play from one instant to the next. I'm thinking, in particular, of
btree index specialisations, where it could make perfect sense to "go
crazy". You cannot have a reasonable discussion about such costs
without considering that they will perhaps never be paid, given any
plausible workload. That's why the percentage that the postgres binary
size has been shown to increase by really isn't pertinent at all. At
best, it's a weak proxy for such costs, assuming you don't have a
demonscene-like preoccupation with reducing binary size, and I don't
believe that we should.

When you start a binary, your executable is mapped to a file system
binary, and you page-fault in the pages you need. Now, if your
optimization was alone in its own 4k (x86) virtual pages, and you never
called the functions, odds are you would not pay a penalty, aside from
distribution penalty, and perhaps a small penalty if useful code was
before and after your block.

The sort code expansion, however, is done in-line, in the middle of the
sort code, you are clearly are filling in 64-byte (x86) CPU cache lines
with type-specific expansion code for every sort case, whether we use
the code or not. Now, I don't think it is a big problem, and I think
the speedup is worth it for common data types, but we can't say the cost
is zero.

Saying it another way, having a binary in your file system that you
never call is not overhead except for storage, but in this case, the
sort code expansion is inside critical functions we are already calling.

Frankly, it is the cost that has kept use away from using such
optimizations for a long time. I recently posted that the zero-cost
optimizations are mostly completed for sort and COPY, and we have to
start considering non-zero-cost optimizations --- sad, but true.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#50

bruce@momjian.us

almost 14 years ago

In reply to: Robert Haas (#48)

Re: Progress on fast path sorting, btree index creation time

On Wed, Feb 08, 2012 at 10:17:36AM -0500, Robert Haas wrote:

On Wed, Feb 8, 2012 at 9:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

IMO this patch is already well past the point of diminishing returns in
value-per-byte-added. ï¿½I'd like to see it trimmed back to provide a fast
path for just single-column int4/int8/float4/float8 sorts. ï¿½The other
cases aren't going to offer enough of a win to justify the code space.

I'm curious about how much we're gaining from the single-column
specializations vs. the type-specific specializations. I think I'm
going to go try to characterize that.

Yes, please. That would be a big help. Is there no optimization for
strings? I assume they are sorted a lot.

We can alway add more data types later.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#51

tgl@sss.pgh.pa.us

almost 14 years ago

In reply to: Bruce Momjian (#50)

Re: Progress on fast path sorting, btree index creation time

Bruce Momjian <bruce@momjian.us> writes:

Yes, please. That would be a big help. Is there no optimization for
strings? I assume they are sorted a lot.

It seems unlikely that it'd be worth including strings, especially if
your locale is not C. This whole thing only makes sense for datatypes
that are comparable in approximately 1 machine instruction.

regards, tom lane

#52

bruce@momjian.us

almost 14 years ago

In reply to: Tom Lane (#51)

Re: Progress on fast path sorting, btree index creation time

On Wed, Feb 08, 2012 at 11:35:46AM -0500, Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

Yes, please. That would be a big help. Is there no optimization for
strings? I assume they are sorted a lot.

It seems unlikely that it'd be worth including strings, especially if
your locale is not C. This whole thing only makes sense for datatypes
that are comparable in approximately 1 machine instruction.

Ah, OK, interesting.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#53

peter@2ndquadrant.com

almost 14 years ago

In reply to: Robert Haas (#48)

Re: Progress on fast path sorting, btree index creation time

On 8 February 2012 15:17, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Feb 8, 2012 at 9:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

IMO this patch is already well past the point of diminishing returns in
value-per-byte-added. I'd like to see it trimmed back to provide a fast
path for just single-column int4/int8/float4/float8 sorts. The other
cases aren't going to offer enough of a win to justify the code space.

I'm curious about how much we're gaining from the single-column
specializations vs. the type-specific specializations. I think I'm
going to go try to characterize that.

I think it might make more sense to lose the type-specific
specialisations for the multi-key case while adding a generic
multi-key specialisation, than to lose all multi-key specialisations,
though I have not considered that question at length, and would think
that we'd still want to keep an int4 version in that case. Note that I
*did not* include a generic multi-key specialisation, though only
because I saw little point, having already covered by far the most
common cases.

While you're at it, I'd like to suggest that you perform a benchmark
on a multi-key specialisation, so we can see just what we're throwing
away before we do so. Better to have those numbers come from you.

I continue to maintain that the most appropriate course of action is
to provisionally commit all specialisations. If it's hard to know what
effect this is going to have on real workloads, let's defer to beta
testers, who presumably try the new release out with their
application. It's a question you could squarely put to them, without
gradually rolling back from that initial position being much of a
problem.

The mysql-server package is 45 MB on Fedora 16. That 1% of Postgres
binary figure is for my earlier patch with btree specialisations,
right? I'm not asking you to look at that right now. I also don't
think that "where do we eventually draw the line with specialisations
like this in Postgres generally?" is a question that you should expect
me to answer, though I will say that we should look at each case on
its merits.

I have not "totally denied" binary bloat costs. I have attempted to
quantify them, while acknowledging that such a task is difficult, as
was evident from the fact that Robert "wasn't suprised" that I could
not demonstrate any regression. Granted, my definition of a regression
is that there is very clearly no net loss in performance at some
reasonable granularity, which is a very practical definition. You can
quite easily contrive a case that HOT handles really badly. Some
people did, I believe, but HOT won out because it was clearly very
useful in the real world.

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

#54

robertmhaas@gmail.com

almost 14 years ago

In reply to: Bruce Momjian (#50)

Re: Progress on fast path sorting, btree index creation time

On Wed, Feb 8, 2012 at 10:59 AM, Bruce Momjian <bruce@momjian.us> wrote:

On Wed, Feb 08, 2012 at 10:17:36AM -0500, Robert Haas wrote:

On Wed, Feb 8, 2012 at 9:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

IMO this patch is already well past the point of diminishing returns in
value-per-byte-added. I'd like to see it trimmed back to provide a fast
path for just single-column int4/int8/float4/float8 sorts. The other
cases aren't going to offer enough of a win to justify the code space.

I'm curious about how much we're gaining from the single-column
specializations vs. the type-specific specializations. I think I'm
going to go try to characterize that.

Yes, please. That would be a big help. Is there no optimization for
strings? I assume they are sorted a lot.

They are, but the value of the optimization is expected to diminish
for more complex comparators.

I did a quick test of this using the same setup I mentioned upthread:

create table nodups (g) as select (g%10)*1000+g/10 from
generate_series(1,10000) g;

This was the actual test query:

select * from nodups order by g offset 10001;

I did a ten-second warmup on after starting each postmaster, followed
by three one-minute test runs:

master:
tps = 298.824321 (including connections establishing)
tps = 301.741619 (including connections establishing)
tps = 302.238016 (including connections establishing)

Peter's patch, but with the type-specific optimizations disabled, so
using just the single-sortkey optimizations:
tps = 357.553690 (including connections establishing)
tps = 358.765875 (including connections establishing)
tps = 358.825051 (including connections establishing)

Peter's patch, intact:
tps = 394.566994 (including connections establishing)
tps = 394.228667 (including connections establishing)
tps = 394.492677 (including connections establishing)

So that's about a 19% speedup for just the single sort-key
optimizations, and about a 30% speedup in total. Compared to a
baseline that includes the single sort-key optimizations, the
additional type-specific optimization is buying about 10% on this
test.

I tried changing the test query to return the data, like this:

select * from nodups order by g;

master:
tps = 181.301129 (including connections establishing)
tps = 180.348333 (excluding connections establishing)
tps = 179.600825 (including connections establishing)

Peter's patch, but with the type-specific optimizations disabled, so
using just the single-sortkey optimizations:
tps = 201.728269 (including connections establishing)
tps = 198.022527 (including connections establishing)
tps = 199.663082 (including connections establishing)

Peter's patch, intact:
tps = 214.037387 (including connections establishing)
tps = 212.895145 (including connections establishing)
tps = 211.596558 (including connections establishing)

So, with the overhead of actually returning the sorted data, we get
10% from the generic single-sortkey optimizations and 18% from the two
combined. Compared to a baseline that includes the single-sortkey
optimizations, the incremental improvement from adding the
type-specific optimizations is about 6.6%. It would be less on a
wider table, presumably.

I also tested a two-column sort:

create table twocol (a, b) as select g%101, g%17 from
generate_series(1,10000) g;
select * from twocol order by a, b offset 10000;

master:
tps = 270.218604 (including connections establishing)
tps = 265.634964 (including connections establishing)
tps = 268.520805 (including connections establishing)

Peter's patch, intact:
tps = 285.528626 (including connections establishing)
tps = 285.140345 (including connections establishing)
tps = 282.388457 (including connections establishing)

So here the type-specific optimizations are buying us even less: only
about 6%, and that's with no client overhead.

And I tested float8:

create table f8 (g) as select random() from generate_series(1,10000);
select * from f8 order by g offset 10000;

master:
tps = 228.373239 (including connections establishing)
tps = 225.117704 (including connections establishing)
tps = 225.196197 (including connections establishing)

Peter's patch, but with the type-specific optimizations disabled, so
using just the single-sortkey optimizations:
tps = 263.060820 (including connections establishing)
tps = 262.788926 (including connections establishing)
tps = 263.041186 (including connections establishing)

Peter's patch, intact:
tps = 283.274797 (including connections establishing)
tps = 280.597965 (including connections establishing)
tps = 280.151349 (including connections establishing)

That's +17% for the single-sortkey stuff, +25% for everything, and the
incremental improvement of the type-specific optimizations over the
generic single-key optimizations is about 6.7%.

It seems clear that the single sort-key optimizations are a much
better value per byte of code than the type-specific optimizations.
Ignoring client overhead, we get between half and two-thirds of the
benefit, and it costs us just one extra copy of the quicksort logic
for all data types. The type-specific optimizations only apply to a
subset of the available data types, produce a lot more machine code
(one extra copy of quicksort per type), and the absolute performance
benefit is less. My gut feeling is that the few extra percentage
points we can get by including lots of extra copies of quicksort is
not worth it. There are probably many cases where we can squeeze a
few extra points out of inlining (Tom and I have discussed a few of
them in the past, in particular one related to index-only scans) and I
don't have any confidence that this is the most important one, or even
in the top ten. If we were talking about speeding up COPY or index
lookups, I might well feel differently, but very few people are going
to quicksort 10,000 rows on a sufficiently regular basis to justify
doing this. To even measure the benefit of this, you have to set up a
totally artificial test case (such as the above), and you have to run
it via pgbench, because if you just run it interactively, the
run-to-run variation swamps the actual improvement, and without doing
a statistical analysis of the results you can't even be sure whether
anything's changed. I just can't get excited about that. However, I
find the single-key optimizations much more compelling, for the
reasons stated above, and feel we ought to include those.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#55

tgl@sss.pgh.pa.us

almost 14 years ago

In reply to: Robert Haas (#54)

Re: Progress on fast path sorting, btree index creation time

Robert Haas <robertmhaas@gmail.com> writes:

[ lots of numbers ]

... I just can't get excited about that. However, I
find the single-key optimizations much more compelling, for the
reasons stated above, and feel we ought to include those.

This conclusion seems sound to me, for the reasons you stated and one
more: optimizations for a single sort key are going to be applicable
to a very wide variety of queries, whereas all the other cases are
necessarily less widely applicable.

regards, tom lane

#56

peter@2ndquadrant.com

almost 14 years ago

In reply to: Robert Haas (#54)

Re: Progress on fast path sorting, btree index creation time

On 8 February 2012 17:58, Robert Haas <robertmhaas@gmail.com> wrote:

It seems clear that the single sort-key optimizations are a much
better value per byte of code than the type-specific optimizations.
Ignoring client overhead, we get between half and two-thirds of the
benefit, and it costs us just one extra copy of the quicksort logic
for all data types.

That was clear from an early stage, and is something that I
acknowledged way back in September - it's pretty obvious that chasing
diminishing returns is the nature of this sort of thing, which is
fine, provided that they are not chased beyond some point that doesn't
make sense. However, I do not see why we should not at least include
full integer specialisations as well (for single-keys at least), given
that we get the additional, not inconsiderable improvements. I do not
accept the premise that we need to find the optimal bytes added to
performance increase ratio, because, as I've already stressed, adding
bytes to the application binary does not have a linear cost.

Here's what might be a good compromise:

singlekey generic, singlekey int4, singlekey int8, multikey generic.

That is a total number of 4 contiguous copies of the qsort, because
we've obsoleted the original qsort_arg (accept for btree index
sorting, which likely matters far less). We get the most benefit for 5
very common types - int4, int8, date, timestamp and timestamptz. A 30%
improvement has to be better than the 19% speedup for just the single
sort-key optimisations, considering that in practice these types are
very commonly used.

I think that there may be additional benefits from making the
qsort_arg specialisation look less like a c stdlib one, like refining
the swap logic to have compile-time knowledge of the type it is
sorting. I'm thinking that we could usefully trim quite a bit from
this:

#define SWAPINIT(a, es) swaptype = ((char *)(a) - (char *)0) % sizeof(long) || \
(es) % sizeof(long) ? 2 : (es) == sizeof(long)? 0 : 1;

#define thistype_vecswap(a, b, n) \
if ((n) > 0) inl_swapfunc((a), (b), (size_t)(n), swaptype)

#define thistype_swap(a, b) \
if (swaptype == 0) { \
long t = *(long *)(void *)(a); \
*(long *)(void *)(a) = *(long *)(void *)(b); \
*(long *)(void *)(b) = t; \
} else \
inl_swapfunc(a, b, es, swaptype)

inline static void
inl_swapfunc(char *a, char *b, size_t n, int swaptype)
{
if (swaptype <= 1)
swapcode(long, a, b, n);
else
swapcode(char, a, b, n);
}

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

#57

peter@2ndquadrant.com

almost 14 years ago

In reply to: Peter Geoghegan (#56)

Re: Progress on fast path sorting, btree index creation time

On 8 February 2012 18:48, Peter Geoghegan <peter@2ndquadrant.com> wrote:

I think that there may be additional benefits from making the
qsort_arg specialisation look less like a c stdlib one, like refining
the swap logic to have compile-time knowledge of the type it is
sorting. I'm thinking that we could usefully trim quite a bit from
this:

It seems like while we cannot get any better performance, no doubt
because the code is already using all manner of optimisations,
including perhaps constant propagation, we can simplify this part of
the code quite a bit. That seems fairly incidental though.

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

#58

robertmhaas@gmail.com

almost 14 years ago

In reply to: Peter Geoghegan (#56)

Re: Progress on fast path sorting, btree index creation time

On Wed, Feb 8, 2012 at 1:48 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:

That was clear from an early stage, and is something that I
acknowledged way back in September

OK, so why didn't/don't we do and commit that part first, and then
proceed to argue about the remainder once it's in?

I think that there may be additional benefits from making the
qsort_arg specialisation look less like a c stdlib one, like refining
the swap logic to have compile-time knowledge of the type it is
sorting. I'm thinking that we could usefully trim quite a bit from
this:

That's an interesting idea, which seems worth pursuing, though
possibly not for 9.2.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#59

[1]: http://archives.postgresql.org/message-id/CA+TgmoZO1xSz+YiqZ2mRoKMcMqtb+JiR0Lz43CNe6de7--QDAA@mail.gmail.com
[2]: http://archives.postgresql.org/message-id/CA+TgmobFz8SqTGTaaRYEryWUZFODGETprbuYgBhsPLR4+Li6TQ@mail.gmail.com

peter@2ndquadrant.com

almost 14 years ago

In reply to: Robert Haas (#58)

Re: Progress on fast path sorting, btree index creation time

On 8 February 2012 23:33, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Feb 8, 2012 at 1:48 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:

That was clear from an early stage, and is something that I
acknowledged way back in September

OK, so why didn't/don't we do and commit that part first, and then
proceed to argue about the remainder once it's in?

I have no objection to that. I'm not sure that it's going to be
possible to agree to anything beyond what has already been agreed. We
don't seem to be covering new ground.

What would it take to convince you to be more inclusive, and have at
least a few full specialisations, in particular, int4 and int8 single
key full specialisations? I'm sure that you'll agree that they're the
second most compelling case. There doesn't seem to be a standard that
I can meet that you can even /describe/ to have those additional
specialisations committed, which is a shame, since they appear to be
pretty decent wins above and beyond the generic single key
specialisation, all things considered.

I suppose that the standard refrain here is "it's your patch; you must
prove the case for committing it". That would be fine by me, but this
isn't just about me, and it seems to be a practical impossibility in
this case. It may be quite impossible even with far greater resources
than I can afford to apply here, since it seems like you're hinting at
some kind of rigorous proof that this cannot cause a regression for a
single client, even though in order to see that regression multiple
other clients would have to be benefiting. I cannot provide such a
proof, since it would probably have to consider all commercially
available CPUs and all possible workloads - I doubt that anyone can.
That being the case, I'm eager not to waste any more time on this. I
bear no ill will; that just seems to be the situation that we find
ourselves in.

That said, if you can describe the standard, I'll try and meet it -
you seem to be suggesting that months and months of benchmarks
wouldn't really change things, since this is the first step on the
road to binary bloat ruin. The only fundamentally new argument that I
can offer is that by applying the standard that the community does
here, it severely limits the number of performance improvements that
can be added, and the aggregate effect is that we're quite certainly
worse off. It's a pity this isn't a really easy decision for you to
make, but that's the nature of these things, and we should expect that
to increasingly be the case. Is it really so unacceptable for us to
risk doing a little worse under some rare circumstances in order to do
better under some common circumstances? Why should it be an easy
decision - when are important decisions ever easy?

I think that there may be additional benefits from making the
qsort_arg specialisation look less like a c stdlib one, like refining
the swap logic to have compile-time knowledge of the type it is
sorting. I'm thinking that we could usefully trim quite a bit from
this:

That's an interesting idea, which seems worth pursuing, though
possibly not for 9.2.

Well, it's really just a code clean-up. I suspect that qsort_arg is a
minimally modified version of the NetBSD one that wasn't necessarily
thoroughly understood when it was initially added (not that I'm
suggesting that it ought to have been). Then again, you might prefer
to keep it as consistent as possible with qsort (the other copy of
that sort function that we already have).

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

#60

Noah Misch

noah@leadboat.com

almost 14 years ago

In reply to: Robert Haas (#43)

Re: Progress on fast path sorting, btree index creation time

On Tue, Feb 07, 2012 at 09:38:39PM -0500, Robert Haas wrote:

Second, there's a concern about binary bloat: duplicating lots of code
with different comparators inlined generates, well, a lot of machine
code. Of course, an 0.8% increase in the size of the resulting binary
is very unlikely to produce any measurable performance regression on
its own, but that's not really the issue. The issue is rather that we
could easily go nuts applying this technique in lots of different
places - or even just in one place to lots of different types. Let's
say that we find that even for types with very complex comparators, we
can get a 5% speedup on quick-sorting speed using this technique.
Should we, then, apply the technique to every type we have? Perhaps
the binary would grow by, I don't know, 10%. Maybe that's still not
enough to start hurting performance (or making packagers complain),
but surely it's getting there, and what happens when we discover that
similar speedups are possible using the same trick in five other
scenarios? I think it's a bit like going out to dinner and spending
$50. It's nicer than eating at home, and many people can afford to do
it occasionally, but if you do it every night, it gets expensive (and
possibly fattening).

So we need some principled way of deciding how much inlining is
reasonable, because I am 100% certain this is not going to be the last
time someone discovers that a massive exercise in inlining can yield a
nifty performance benefit in some case or another: index builds and
COPY have already been mentioned on this thread, and I expect that
people will find other cases as well. I'm not really sure what the
"budget" is - i.e. how much binary bloat we can afford to add - or how
many cases there are that can benefit, but the first isn't infinite
and the second is more than the first.

Having such a metric would resolve this discussion, but formulating it will be
all but impossible. We don't know how much time PostgreSQL installations now
spend sorting or how much additional time they would spend on cache misses,
TLB misses, page faults, etc. That data won't show up on this thread.

You posed[1]http://archives.postgresql.org/message-id/CA+TgmoZO1xSz+YiqZ2mRoKMcMqtb+JiR0Lz43CNe6de7--QDAA@mail.gmail.com the straw man of a sin(numeric) optimization requiring a 40GB
lookup table. I would not feel bad about rejecting that, because it can live
quite comfortably as an external module. Indeed, I think the best thing we
can do to constrain long-term bloat in the "postgres" executable is to improve
pluggability. Let a wider range of features live outside the postmaster
binary. For example, if we had the infrastructure to let hash, GIN, GiST and
SP-GiST index access methods live in their own DSOs (like PL/pgSQL does
today), I would support doing that. Extensibility is a hallmark of
PostgreSQL. It's a bad day when we reject a minority-interest feature or
optimization that has no other way to exist.

Better pluggability can't ease the decision process for Peter's patch, because
its specializations essentially must live in the "postgres" executable to live
at all. Nominally, we could let external modules inject sort specializations
for core types, and Peter could publish an all_fast_sorts extension. That
would be useless punting: if we can't make a principled decision on whether
these accelerated sorts justify 85 KiB of binary, how will a DBA who discovers
the module make that decision?

This patch has gotten more than its fair share of attention for bloat, and I
think that's mostly because there's a dial-a-bloat-level knob sticking out and
begging to be frobbed. On my system, fastpath_sort_2012_01_19.patch adds 85
KiB to the postgres binary. A recent commit added 57 KiB and bloat never came
up in discussions thereof. I appreciate your concern over a slippery slope of
inlining proposals, but I also don't wish to see the day when every patch
needs to declare and justify its binary byte count. Unless, of course, we
discover that elusive metric for evaluating it fairly.

If we'd like to start taking interest in binary bloat, how about having the
buildfarm log capture an "ls -l" on the bin directory? We'll then have data
to mine if we ever wish to really take action.

All that being said, I'd want to see a 15-30% (depending on how contrived)
improvement to a microbenchmark or a 5% improvement to a generic benchmark
(pgbench, DBT-<N>) before adopting an optimization of this complexity. Based
on your measurements[2]http://archives.postgresql.org/message-id/CA+TgmobFz8SqTGTaaRYEryWUZFODGETprbuYgBhsPLR4+Li6TQ@mail.gmail.com, per-type inlining gives us a 7% microbenchmark
improvement. I'd therefore excise the per-type inlining. (For the record,
given a 20% improvement to the same benchmark, I'd vote yea on the submission
and perhaps even suggest int2 and uuid support.)

#61

robertmhaas@gmail.com

almost 14 years ago

In reply to: Noah Misch (#60)

Re: Progress on fast path sorting, btree index creation time

On Thu, Feb 9, 2012 at 7:24 AM, Noah Misch <noah@leadboat.com> wrote:

On Tue, Feb 07, 2012 at 09:38:39PM -0500, Robert Haas wrote:

Second, there's a concern about binary bloat: duplicating lots of code
with different comparators inlined generates, well, a lot of machine
code. Of course, an 0.8% increase in the size of the resulting binary
is very unlikely to produce any measurable performance regression on
its own, but that's not really the issue. The issue is rather that we
could easily go nuts applying this technique in lots of different
places - or even just in one place to lots of different types. Let's
say that we find that even for types with very complex comparators, we
can get a 5% speedup on quick-sorting speed using this technique.
Should we, then, apply the technique to every type we have? Perhaps
the binary would grow by, I don't know, 10%. Maybe that's still not
enough to start hurting performance (or making packagers complain),
but surely it's getting there, and what happens when we discover that
similar speedups are possible using the same trick in five other
scenarios? I think it's a bit like going out to dinner and spending
$50. It's nicer than eating at home, and many people can afford to do
it occasionally, but if you do it every night, it gets expensive (and
possibly fattening).

So we need some principled way of deciding how much inlining is
reasonable, because I am 100% certain this is not going to be the last
time someone discovers that a massive exercise in inlining can yield a
nifty performance benefit in some case or another: index builds and
COPY have already been mentioned on this thread, and I expect that
people will find other cases as well. I'm not really sure what the
"budget" is - i.e. how much binary bloat we can afford to add - or how
many cases there are that can benefit, but the first isn't infinite
and the second is more than the first.

Having such a metric would resolve this discussion, but formulating it will be
all but impossible. We don't know how much time PostgreSQL installations now
spend sorting or how much additional time they would spend on cache misses,
TLB misses, page faults, etc. That data won't show up on this thread.

You posed[1] the straw man of a sin(numeric) optimization requiring a 40GB
lookup table. I would not feel bad about rejecting that, because it can live
quite comfortably as an external module. Indeed, I think the best thing we
can do to constrain long-term bloat in the "postgres" executable is to improve
pluggability. Let a wider range of features live outside the postmaster
binary. For example, if we had the infrastructure to let hash, GIN, GiST and
SP-GiST index access methods live in their own DSOs (like PL/pgSQL does
today), I would support doing that. Extensibility is a hallmark of
PostgreSQL. It's a bad day when we reject a minority-interest feature or
optimization that has no other way to exist.

Better pluggability can't ease the decision process for Peter's patch, because
its specializations essentially must live in the "postgres" executable to live
at all. Nominally, we could let external modules inject sort specializations
for core types, and Peter could publish an all_fast_sorts extension. That
would be useless punting: if we can't make a principled decision on whether
these accelerated sorts justify 85 KiB of binary, how will a DBA who discovers
the module make that decision?

This patch has gotten more than its fair share of attention for bloat, and I
think that's mostly because there's a dial-a-bloat-level knob sticking out and
begging to be frobbed. On my system, fastpath_sort_2012_01_19.patch adds 85
KiB to the postgres binary. A recent commit added 57 KiB and bloat never came
up in discussions thereof. I appreciate your concern over a slippery slope of
inlining proposals, but I also don't wish to see the day when every patch
needs to declare and justify its binary byte count. Unless, of course, we
discover that elusive metric for evaluating it fairly.

If we'd like to start taking interest in binary bloat, how about having the
buildfarm log capture an "ls -l" on the bin directory? We'll then have data
to mine if we ever wish to really take action.

All that being said, I'd want to see a 15-30% (depending on how contrived)
improvement to a microbenchmark or a 5% improvement to a generic benchmark
(pgbench, DBT-<N>) before adopting an optimization of this complexity. Based
on your measurements[2], per-type inlining gives us a 7% microbenchmark
improvement. I'd therefore excise the per-type inlining. (For the record,
given a 20% improvement to the same benchmark, I'd vote yea on the submission
and perhaps even suggest int2 and uuid support.)

That's all well-said and I mostly agree with all of it, including your
suggested percentages in the last paragraph (but does anyone really
sort int2s or uuids?). What I think is different about inlining as
compared to writing code is that it's possible to generate an amount
of binary bloat that is disproportionate to the amount of code you
actually wrote. You can blow up your binary more or less on
auto-pilot, just by adding more macro invocations. Of course, the
overall impact is the same, but when you have to actually write the
code by hand, you're more likely to notice that you're cranking out
hundreds of lines of code of dubious real-world value.

I think the other thing to keep in mind is that, inevitably, people
decide what considerations to worry about with reference to a
particular patch based on the perceived value of the feature. I
thought about the fact that Tom's parameterized-path patch adds a lot
of new code and to be honest I'm not totally sanguine about it, but on
the other hand it has the potential to make certain kinds of queries
that were seriously slow in previous releases run hundreds or
thousands of times faster, and maybe open the door to further planner
optimizations in the future. When you start talking about multiples
with lots of zeros on the end, binary bloat fades out of the picture:
you COULD assess it, but we already pretty much know what the outcome
of that discussion will be. Similarly, when you hear Tom complaining
about the cost of adding a new grammar keyword, that's a tip-off that
he either thinks that no effort whatsoever was made to find a syntax
that will work with the existing keywords, or he doesn't think the
feature is worth very much, because if you look at git log -p
--author=Lane src/include/parser/kwlist.h you'll find he's added them
on various occasions with nary a word spoken. What we need to be wary
of, in my opinion anyway, is not adding binary bloat or keywords per
se, but rather of adding them for trivial gains. And I do consider a
6% gain on a highly artificial macrobenchmark to be trivial. If we
were talking about a 10x speedup, I doubt the word "bloat" would be
anywhere on this thread, and it certainly wouldn't be getting the
prominence that it has here.

I'm also more than slightly concerned that we are losing sight of the
forest for the trees. I have heard reports that sorting large amounts
of data is many TIMES slower for us than for a certain other database
product. I frankly wonder if this entire patch is a distraction.
When inlining is the tool of choice for further performance
improvement, it suggests that we're pretty much out of ideas, and that
whatever we get out of inlining - be it 6% or 30% - will be the end.
I don't like that idea very much.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#62

peter@2ndquadrant.com

almost 14 years ago

In reply to: Robert Haas (#61)

Re: Progress on fast path sorting, btree index creation time

On 9 February 2012 13:50, Robert Haas <robertmhaas@gmail.com> wrote:

I'm also more than slightly concerned that we are losing sight of the
forest for the trees. I have heard reports that sorting large amounts
of data is many TIMES slower for us than for a certain other database
product. I frankly wonder if this entire patch is a distraction.
When inlining is the tool of choice for further performance
improvement, it suggests that we're pretty much out of ideas, and that
whatever we get out of inlining - be it 6% or 30% - will be the end.
I don't like that idea very much.

If you're talking about the product I think you're talking about,
well, their contracts forbid any actual benchmarks, so we won't have
any hard figures, but I find it intuitively difficult to believe that
the difference is that large, mostly because I've demonstrated that
the performance of qsort is a pretty good proxy for the performance of
a query like "select * from foo order by bar", and qsort can't be too
far from optimal. Besides, didn't you yourself say "I've been
skeptical of this patch for a number of reasons, the major one of
which is that most workloads spend only a very small amount of time in
doing qucksorts"?

I have a new plan. I think you'll like it:

* drop the type specific specialisations entirely.

* drop qsort_arg (the original, unspecialised version). We now have
exactly the same number of source code copies of qsort as before: 2.

* gut the comparison of the leading sort key only from
comparetup_heap, comparetup_index_btree, etc (I collectively refer to
these functions as tuple-class encapsulating comparators, distinct
from their comparator proper).

* Instantiate two specialisations of qsort_arg: single key and
multiple key (exactly the same number as has already been agreed to,
since we lost the generic version). However, there is one key
difference now: we call one of the tuple class encapsulating functions
through a function pointer if the first comparison gives equality - .
Early indications are that this is almost as much of a win as the
single-key case. So the code effectively does this (but through a
function pointer):

#define MULTI_ADDITIONAL_CODE(COMPAR) \
{ \
return comparetup_heap(aT, bT, state); \
}

This works because most comparisons won't actually need to look at the
second key, and because the cut-down tuple-class encapsulating
comparator that is used in the last revision of the patch doesn't
actually make any assumption about the particular tuple-class
encapsulating comparator in use.

It may not even be worth-while having a separate copy of the qsort
function for the multiple key case, so the question of binary bloat
becomes entirely irrelevant, as there would be a net gain of zero
copies of qsort_arg.

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

#63

robertmhaas@gmail.com

almost 14 years ago

In reply to: Peter Geoghegan (#62)

Re: Progress on fast path sorting, btree index creation time

On Thu, Feb 9, 2012 at 9:37 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote:

On 9 February 2012 13:50, Robert Haas <robertmhaas@gmail.com> wrote:

I'm also more than slightly concerned that we are losing sight of the
forest for the trees. I have heard reports that sorting large amounts
of data is many TIMES slower for us than for a certain other database
product. I frankly wonder if this entire patch is a distraction.
When inlining is the tool of choice for further performance
improvement, it suggests that we're pretty much out of ideas, and that
whatever we get out of inlining - be it 6% or 30% - will be the end.
I don't like that idea very much.

If you're talking about the product I think you're talking about,
well, their contracts forbid any actual benchmarks, so we won't have
any hard figures, but I find it intuitively difficult to believe that
the difference is that large, mostly because I've demonstrated that
the performance of qsort is a pretty good proxy for the performance of
a query like "select * from foo order by bar", and qsort can't be too
far from optimal. Besides, didn't you yourself say "I've been
skeptical of this patch for a number of reasons, the major one of
which is that most workloads spend only a very small amount of time in
doing qucksorts"?

Emphasis on "large" amounts of data.

I've said from the beginning that the small-sort case is less
interesting to me: it's nice to make it faster, but nobody's going to
quit using PostgreSQL because a quicksort takes 8ms instead of 6ms.
It's when we start talking about operations that take minutes or hours
that the differences become material.

I have a new plan. I think you'll like it:

* drop the type specific specialisations entirely.

Check.

* drop qsort_arg (the original, unspecialised version). We now have
exactly the same number of source code copies of qsort as before: 2.

This may be useful elsewhere some day, but I suppose we can always
pull it back out of git if we need it.

* gut the comparison of the leading sort key only from
comparetup_heap, comparetup_index_btree, etc (I collectively refer to
these functions as tuple-class encapsulating comparators, distinct
from their comparator proper).

* Instantiate two specialisations of qsort_arg: single key and
multiple key (exactly the same number as has already been agreed to,
since we lost the generic version). However, there is one key
difference now: we call one of the tuple class encapsulating functions
through a function pointer if the first comparison gives equality - .
Early indications are that this is almost as much of a win as the
single-key case. So the code effectively does this (but through a
function pointer):

#define MULTI_ADDITIONAL_CODE(COMPAR) \
{ \
return comparetup_heap(aT, bT, state); \
}

This works because most comparisons won't actually need to look at the
second key, and because the cut-down tuple-class encapsulating
comparator that is used in the last revision of the patch doesn't
actually make any assumption about the particular tuple-class
encapsulating comparator in use.

It may not even be worth-while having a separate copy of the qsort
function for the multiple key case, so the question of binary bloat
becomes entirely irrelevant, as there would be a net gain of zero
copies of qsort_arg.

I'm not sure I entirely follow all this, but I'll look at the code
once you have it. Are you saying that all the comparetup_yadda
functions are redundant to each other in the single-key case?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#64

bruce@momjian.us

almost 14 years ago

In reply to: Noah Misch (#60)

Re: Progress on fast path sorting, btree index creation time

On Thu, Feb 09, 2012 at 07:24:49AM -0500, Noah Misch wrote:

This patch has gotten more than its fair share of attention for bloat, and I
think that's mostly because there's a dial-a-bloat-level knob sticking out and
begging to be frobbed.

I already emailed Peter privately stating that he shouldn't feel bad
about the many critiques of his patch --- this dial-a-bloat-level knob
is new for us, and we have to get used to the idea. We are more
discussing the idea of a dial-a-bloat-level knob rather than his patch.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#65

peter@2ndquadrant.com

almost 14 years ago

In reply to: Robert Haas (#63)

Re: Progress on fast path sorting, btree index creation time

On 9 February 2012 14:51, Robert Haas <robertmhaas@gmail.com> wrote:

I'm not sure I entirely follow all this, but I'll look at the code
once you have it. Are you saying that all the comparetup_yadda
functions are redundant to each other in the single-key case?

Yes, I am. The main reason that the loops exist in those functions
(which is the only way that they substantially differ) is because they
each have to get the other keys through various ways that characterise
the tuple class that they encapsulate (index_getattr(),
heap_getattr(), etc).

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

#66

bruce@momjian.us

almost 14 years ago

In reply to: Peter Geoghegan (#65)

Re: Progress on fast path sorting, btree index creation time

On Thu, Feb 09, 2012 at 03:36:23PM +0000, Peter Geoghegan wrote:

On 9 February 2012 14:51, Robert Haas <robertmhaas@gmail.com> wrote:

I'm not sure I entirely follow all this, but I'll look at the code
once you have it. ï¿½Are you saying that all the comparetup_yadda
functions are redundant to each other in the single-key case?

Yes, I am. The main reason that the loops exist in those functions
(which is the only way that they substantially differ) is because they
each have to get the other keys through various ways that characterise
the tuple class that they encapsulate (index_getattr(),
heap_getattr(), etc).

Does this help all types for sorting, including strings?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#67

peter@2ndquadrant.com

almost 14 years ago

In reply to: Bruce Momjian (#66)

Re: Progress on fast path sorting, btree index creation time

On 9 February 2012 17:16, Bruce Momjian <bruce@momjian.us> wrote:

Yes, I am. The main reason that the loops exist in those functions
(which is the only way that they substantially differ) is because they
each have to get the other keys through various ways that characterise
the tuple class that they encapsulate (index_getattr(),
heap_getattr(), etc).

Does this help all types for sorting, including strings?

Yes, it does, though of course that's not expected to make too much
difference with text when the C locale isn't used, because the
comparisons are inherently much more expensive, and there isn't a
whole lot we can do about that, at least here.

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

#68

peter@2ndquadrant.com

almost 14 years ago

In reply to: Robert Haas (#63)

1 attachment(s)

Re: Progress on fast path sorting, btree index creation time

On 9 February 2012 14:51, Robert Haas <robertmhaas@gmail.com> wrote:

I'm not sure I entirely follow all this, but I'll look at the code
once you have it.

I have attached a revision of the patch, with the adjustments already
described. Note that I have not made this support btree tuplesorting
yet, as there is an impedance mismatch that must be resolved,
particularly with the SortSupport stuff, and I wanted to know what you
think of the multiple key specialisation first. Arguably, we could get
away with only a single specialisation - I haven't really though about
it much.

You say "Well, how often will we sort 10,000 integers?", and I think
that btree index creation is one very common and useful case, so I'd
like to look at that in more detail. I certainly don't see any reason
to not do it too.

This should give you performance for sorting multiple-keys that is
almost as good as the single-key optimisation that you found to be
more compelling. Obviously the need to actually call comparetup_heap
to look at non-leading sortkeys will vary from case to case, and this
is based on your test case, where there are no duplicates and thus no
need to ever do that. That isn't too far from representative, as I
think that in general, a majority of comparisons won't result in
equality.

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Attachments:

fastpath_sort_2012_02_10.patchtext/x-patch; charset=US-ASCII; name=fastpath_sort_2012_02_10.patchDownload

diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
new file mode 100644
index 1452e8c..e040ace
*** a/src/backend/utils/sort/tuplesort.c
--- b/src/backend/utils/sort/tuplesort.c
***************
*** 113,118 ****
--- 113,119 ----
  #include "utils/pg_rusage.h"
  #include "utils/rel.h"
  #include "utils/sortsupport.h"
+ #include "utils/template_qsort_arg.h"
  #include "utils/tuplesort.h"
  
  
*************** struct Tuplesortstate
*** 281,286 ****
--- 282,301 ----
  	int			currentRun;
  
  	/*
+ 	 * Will invocations of a tuple-class encapsulating comparator
+ 	 * (comparetup_heap, comparetup_index_btree, etc) skip the leading key,
+ 	 * because that has already been compared elsewhere?
+ 	 */
+ 	bool		skipLeadingkey;
+ 
+ 	/*
+ 	 * This specialization function pointer is sometimes used as an alternative
+ 	 * to the standard qsort_arg, when it has been determined that we can
+ 	 * benefit from various per number-of-sortkey performance optimizations.
+ 	 */
+ 	void (*qsort_arg_spec)(void *a, size_t n, size_t es, void *arg);
+ 
+ 	/*
  	 * Unless otherwise noted, all pointer variables below are pointers to
  	 * arrays of length maxTapes, holding per-tape data.
  	 */
*************** static void readtup_datum(Tuplesortstate
*** 492,497 ****
--- 507,519 ----
  static void reversedirection_datum(Tuplesortstate *state);
  static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
  
+ /*
+  * A set of type-neutral specializations, for single sortkey and multiple
+  * sortkey sorts. These specialization are used for sorting both heap and
+  * btree index tuples that meet that criteria.
+  */
+ DO_TEMPLATE_QSORT_ARG(single, ONE_ADDITIONAL_CODE, single_comparetup_inline)
+ DO_TEMPLATE_QSORT_ARG(mult, MULTI_ADDITIONAL_CODE, mult_comparetup_inline)
  
  /*
   *		tuplesort_begin_xxx
*************** tuplesort_begin_heap(TupleDesc tupDesc,
*** 631,636 ****
--- 653,661 ----
  		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
  	}
  
+ 	state->qsort_arg_spec =
+ 		nkeys==1?single_qsort_arg:mult_qsort_arg;
+ 	state->skipLeadingkey = true;
  	MemoryContextSwitchTo(oldcontext);
  
  	return state;
*************** tuplesort_performsort(Tuplesortstate *st
*** 1222,1232 ****
  			 * amount of memory.  Just qsort 'em and we're done.
  			 */
  			if (state->memtupcount > 1)
! 				qsort_arg((void *) state->memtuples,
! 						  state->memtupcount,
! 						  sizeof(SortTuple),
! 						  (qsort_arg_comparator) state->comparetup,
! 						  (void *) state);
  			state->current = 0;
  			state->eof_reached = false;
  			state->markpos_offset = 0;
--- 1247,1273 ----
  			 * amount of memory.  Just qsort 'em and we're done.
  			 */
  			if (state->memtupcount > 1)
! 			{
! 				/* Use a sorting specialization if available */
! 				if (state->qsort_arg_spec)
! 					/* specialization available */
! 					state->qsort_arg_spec(
! 								(void *) state->memtuples,
! 								state->memtupcount,
! 								sizeof(SortTuple),
! 								(void *) state);
! 				else
! 					/*
! 					 * Fall back on regular qsort_arg, with function pointer
! 					 * comparator, making no compile-time assumptions about the
! 					 * number of sortkeys or the datatype(s) to be sorted.
! 					 */
! 					qsort_arg((void *) state->memtuples,
! 								  state->memtupcount,
! 								  sizeof(SortTuple),
! 								  (qsort_arg_comparator) state->comparetup,
! 								  (void *) state);
! 			}
  			state->current = 0;
  			state->eof_reached = false;
  			state->markpos_offset = 0;
*************** make_bounded_heap(Tuplesortstate *state)
*** 2407,2412 ****
--- 2448,2454 ----
  	REVERSEDIRECTION(state);
  
  	state->memtupcount = 0;		/* make the heap empty */
+ 	state->skipLeadingkey = false; /* Not using inlining optimization */
  	for (i = 0; i < tupcount; i++)
  	{
  		if (state->memtupcount >= state->bound &&
*************** comparetup_heap(const SortTuple *a, cons
*** 2660,2674 ****
  	int			nkey;
  	int32		compare;
  
! 	/* Allow interrupting long sorts */
! 	CHECK_FOR_INTERRUPTS();
! 
! 	/* Compare the leading sort key */
! 	compare = ApplySortComparator(a->datum1, a->isnull1,
! 								  b->datum1, b->isnull1,
! 								  sortKey);
! 	if (compare != 0)
! 		return compare;
  
  	/* Compare additional sort keys */
  	ltup.t_len = ((MinimalTuple) a->tuple)->t_len + MINIMAL_TUPLE_OFFSET;
--- 2702,2716 ----
  	int			nkey;
  	int32		compare;
  
! 	if (!state->skipLeadingkey)
! 	{
! 		/* Compare the leading sort key */
! 		compare = ApplySortComparator(a->datum1, a->isnull1,
! 									  b->datum1, b->isnull1,
! 									  sortKey);
! 		if (compare != 0)
! 			return compare;
! 	}
  
  	/* Compare additional sort keys */
  	ltup.t_len = ((MinimalTuple) a->tuple)->t_len + MINIMAL_TUPLE_OFFSET;
*************** comparetup_index_btree(const SortTuple *
*** 2981,2990 ****
  					   Tuplesortstate *state)
  {
  	/*
! 	 * This is similar to _bt_tuplecompare(), but we have already done the
! 	 * index_getattr calls for the first column, and we need to keep track of
! 	 * whether any null fields are present.  Also see the special treatment
! 	 * for equal keys at the end.
  	 */
  	ScanKey		scanKey = state->indexScanKey;
  	IndexTuple	tuple1;
--- 3023,3031 ----
  					   Tuplesortstate *state)
  {
  	/*
! 	 * We have already done the index_getattr calls for the first column, and we
! 	 * need to keep track of whether any null fields are present.  Also see the
! 	 * special treatment for equal keys at the end.
  	 */
  	ScanKey		scanKey = state->indexScanKey;
  	IndexTuple	tuple1;
diff --git a/src/include/c.h b/src/include/c.h
new file mode 100644
index 7396adb..939f49a
*** a/src/include/c.h
--- b/src/include/c.h
*************** extern int	fdatasync(int fildes);
*** 850,853 ****
--- 850,865 ----
  /* /port compatibility functions */
  #include "port.h"
  
+ /*
+  * Define a cross-platform "always-inline" macro. This is a very sharp tool that
+  * should be used judiciously.
+  */
+ #ifdef __always_inline
+ #define pg_always_inline __always_inline
+ #elif defined(__force_inline)
+ #define pg_always_inline __force_inline
+ #else
+ #define pg_always_inline inline
+ #endif
+ 
  #endif   /* C_H */
diff --git a/src/include/utils/sortsupport.h b/src/include/utils/sortsupport.h
new file mode 100644
index ef8d853..37b3e80
*** a/src/include/utils/sortsupport.h
--- b/src/include/utils/sortsupport.h
*************** typedef struct SortSupportData
*** 94,103 ****
  	 * INT_MIN, as callers are allowed to negate the result before using it.
  	 */
  	int			(*comparator) (Datum x, Datum y, SortSupport ssup);
- 
- 	/*
- 	 * Additional sort-acceleration functions might be added here later.
- 	 */
  } SortSupportData;
  
  
--- 94,99 ----
diff --git a/src/include/utils/template_qsort_arg.h b/src/include/utils/template_qsort_arg.h
new file mode 100644
index ...9ede3f8
*** a/src/include/utils/template_qsort_arg.h
--- b/src/include/utils/template_qsort_arg.h
***************
*** 0 ****
--- 1,219 ----
+ /*-------------------------------------------------------------------------
+  *	template_qsort_arg.h: "template" version of qsort_arg.c
+  *
+  *	This version of qsort_arg is exclusively used within tuplesort.c to	more
+  *	efficiently sort common types such as integers and floats. In providing this
+  *	version, we seek to take advantage of compile-time optimizations, in
+  *	particular, the inlining of comparators.
+  *
+  *	The TEMPLATE_QSORT_ARG() macro generates an inlining variant (for sorts with
+  *	a single sortKey) and non-inlining variant for sorts with multiple sortKeys.
+  *
+  *	We rely on various function declarations, and indeed partially duplicate
+  *	some code from tuplesort.c, so this file should be considered private to
+  *	that module, rather than a generic piece of infrastructure. However, it is
+  *	reasonable for clients to use this infrastructure to generate their own
+  *	sorting specializations outside of that translation unit.
+  *
+  *	CAUTION: if you change this file, see also qsort_arg.c as well as qsort.c.
+  *	qsort_arg.c should be considered authoritative.
+  *
+  *	Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
+  *	Portions Copyright (c) 1994, Regents of the University of California
+  *
+  *	src/include/utils/template_qsort_arg.h
+  *-------------------------------------------------------------------------
+  */
+ #include "c.h"
+ 
+ #define swapcode(TYPE, parmi, parmj, n)		\
+ do {										\
+ 	size_t i = (n) / sizeof (TYPE);			\
+ 	TYPE *pi = (TYPE *)(void *)(parmi);		\
+ 	TYPE *pj = (TYPE *)(void *)(parmj);		\
+ 	do {									\
+ 		TYPE	t = *pi;					\
+ 		*pi++ = *pj;						\
+ 		*pj++ = t;							\
+ 		} while (--i > 0);					\
+ } while (0)
+ 
+ #define SWAPINIT(a, es) swaptype = ((char *)(a) - (char *)0) % sizeof(long) || \
+ 	(es) % sizeof(long) ? 2 : (es) == sizeof(long)? 0 : 1;
+ 
+ #define thistype_vecswap(a, b, n)	 if ((n) > 0) swapfunc((a), (b), (size_t)(n), swaptype)
+ 
+ #define thistype_swap(a, b)								\
+ 	if (swaptype == 0) {								\
+ 		long t = *(long *)(void *)(a);					\
+ 		*(long *)(void *)(a) = *(long *)(void *)(b);	\
+ 		*(long *)(void *)(b) = t;						\
+ 	} else												\
+ 		swapfunc(a, b, es, swaptype)
+ 
+ pg_always_inline static void
+ swapfunc(char *a, char *b, size_t n, int swaptype)
+ {
+ 	if (swaptype <= 1)
+ 		swapcode(long, a, b, n);
+ 	else
+ 		swapcode(char, a, b, n);
+ }
+ 
+ /*
+  * This macro manufactures a type-specific implementation of qsort_arg with
+  * the comparator, COMPAR, known at compile time. COMPAR is typically an
+  * inline function.
+  *
+  * COMPAR should take as its arguments two Datums, and return an int, in
+  * line with standard qsort convention.
+  *
+  * We have void* parameters for comparetup_inline just to shut up the compiler.
+  * They could be SortTuple pointers instead, but that would make it more
+  * difficult to keep template_qsort_arg.h consistent with tuplesort.c.
+  */
+ 
+ #define DO_TEMPLATE_QSORT_ARG(SPEC_VAR, ADD_CODE, OUT_COMPAR)			\
+ void SPEC_VAR##_qsort_arg(void *a, size_t n, size_t es, void *arg);		\
+ 																					\
+ /*																					\
+  * Note that this is heavily based on comparetup_inline; the two should be kept		\
+  * consistent.																		\
+  */																					\
+ pg_always_inline static int															\
+ SPEC_VAR##_comparetup_inline(const void *a, const void *b,							\
+ 		Tuplesortstate *state)														\
+ {																					\
+ 	const SortTuple* aT = a;														\
+ 	const SortTuple* bT = b;														\
+ 	int32		compare;															\
+ 	SortSupport	sortKey = state->sortKeys;											\
+ 																					\
+ 	/* Compare the leading sort key */												\
+ 	compare = ApplySortComparator(aT->datum1, aT->isnull1,							\
+ 								  bT->datum1, bT->isnull1,							\
+ 								  sortKey);											\
+ 	if (compare != 0)																\
+ 		return compare;																\
+ 	/* Additional code (varies according to number of sortKeys) */					\
+ 	ADD_CODE																		\
+ 	return 0;																		\
+ }																					\
+ 																					\
+ pg_always_inline static char *									\
+ SPEC_VAR##med3(char *a, char *b, char *c, void *arg)			\
+ {																\
+ 	return OUT_COMPAR(a, b, arg) < 0 ?							\
+ 		(OUT_COMPAR(b, c, arg) < 0 ?							\
+ 					b : (OUT_COMPAR(a, c, arg) < 0 ? c : a))	\
+ 		: (OUT_COMPAR(b, c, arg) > 0 ?							\
+ 					b : (OUT_COMPAR(a, c, arg) < 0 ? a : c));	\
+ }																\
+ 																\
+ void																		\
+ SPEC_VAR##_qsort_arg(void *a, size_t n, size_t es, void *arg)				\
+ {																			\
+ 	char	   *pa,															\
+ 			   *pb,															\
+ 			   *pc,															\
+ 			   *pd,															\
+ 			   *pl,															\
+ 			   *pm,															\
+ 			   *pn;															\
+ 	int			d,															\
+ 				r,															\
+ 				swaptype,													\
+ 				presorted;													\
+ 																			\
+ loop:SWAPINIT(a, es);														\
+ 	if (n < 7)																\
+ 	{																		\
+ 		for (pm = (char *) a + es; pm < (char *) a + n * es; pm += es)		\
+ 			for (pl = pm; pl > (char *) a &&								\
+ 					OUT_COMPAR(pl - es, pl, arg) > 0;						\
+ 				 pl -= es)													\
+ 				thistype_swap(pl, pl - es);									\
+ 		return;																\
+ 	}																		\
+ 	presorted = 1;															\
+ 	for (pm = (char *) a + es; pm < (char *) a + n * es; pm += es)			\
+ 	{																		\
+ 		if (OUT_COMPAR(pm - es, pm, arg) > 0)								\
+ 		{																	\
+ 			presorted = 0;													\
+ 			break;															\
+ 		}																	\
+ 	}																		\
+ 	if (presorted)															\
+ 		return;																\
+ 	pm = (char *) a + (n / 2) * es;											\
+ 	if (n > 7)																\
+ 	{																		\
+ 		pl = (char *) a;													\
+ 		pn = (char *) a + (n - 1) * es;										\
+ 		if (n > 40)															\
+ 		{																	\
+ 			d = (n / 8) * es;												\
+ 			pl = SPEC_VAR##med3(pl, pl + d, pl + 2 * d, arg);				\
+ 			pm = SPEC_VAR##med3(pm - d, pm, pm + d, arg);					\
+ 			pn = SPEC_VAR##med3(pn - 2 * d, pn - d, pn, arg);				\
+ 		}																	\
+ 		pm = SPEC_VAR##med3(pl, pm, pn, arg);								\
+ 	}																		\
+ 	thistype_swap(a, pm);													\
+ 	pa = pb = (char *) a + es;												\
+ 	pc = pd = (char *) a + (n - 1) * es;									\
+ 	for (;;)																\
+ 	{																		\
+ 		while (pb <= pc &&													\
+ 					(r = OUT_COMPAR(pb, a, arg)) <= 0)						\
+ 		{																	\
+ 			if (r == 0)														\
+ 			{																\
+ 				thistype_swap(pa, pb);										\
+ 				pa += es;													\
+ 			}																\
+ 			pb += es;														\
+ 		}																	\
+ 		while (pb <= pc &&													\
+ 				(r = OUT_COMPAR(pc, a, arg)) >= 0)							\
+ 		{																	\
+ 			if (r == 0)														\
+ 			{																\
+ 				thistype_swap(pc, pd);										\
+ 				pd -= es;													\
+ 			}																\
+ 			pc -= es;														\
+ 		}																	\
+ 		if (pb > pc)														\
+ 			break;															\
+ 		thistype_swap(pb, pc);												\
+ 		pb += es;															\
+ 		pc -= es;															\
+ 	}																		\
+ 	pn = (char *) a + n * es;												\
+ 	r = Min(pa - (char *) a, pb - pa);										\
+ 	thistype_vecswap(a, pb - r, r);											\
+ 	r = Min(pd - pc, pn - pd - es);											\
+ 	thistype_vecswap(pb, pn - r, r);										\
+ 	if ((r = pb - pa) > es)													\
+ 		SPEC_VAR##_qsort_arg(a, r / es, es, arg);							\
+ 	if ((r = pd - pc) > es)													\
+ 	{																		\
+ 		/* Iterate rather than recurse to save stack space */				\
+ 		a = pn - r;															\
+ 		n = r / es;															\
+ 		goto loop;															\
+ 	}																		\
+ }
+ 
+ /*
+  * Additional code, used depending on the specialization.
+  */
+ #define MULTI_ADDITIONAL_CODE												\
+ {																			\
+ 	Assert(state->nKeys > 1);												\
+ 	return state->comparetup(aT, bT, state);								\
+ }
+ 
+ #define ONE_ADDITIONAL_CODE Assert(state->nKeys == 1);
diff --git a/src/port/qsort.c b/src/port/qsort.c
new file mode 100644
index 8e2c6d9..d1981d6
*** a/src/port/qsort.c
--- b/src/port/qsort.c
***************
*** 7,13 ****
   *	  Remove ill-considered "swap_cnt" switch to insertion sort,
   *	  in favor of a simple check for presorted input.
   *
!  *	CAUTION: if you change this file, see also qsort_arg.c
   *
   *	src/port/qsort.c
   */
--- 7,14 ----
   *	  Remove ill-considered "swap_cnt" switch to insertion sort,
   *	  in favor of a simple check for presorted input.
   *
!  *	CAUTION: if you change this file, see also qsort_arg.c and
!  *	template_qsort_arg.h
   *
   *	src/port/qsort.c
   */
diff --git a/src/port/qsort_arg.c b/src/port/qsort_arg.c
new file mode 100644
index 28d1894..0ab6198
*** a/src/port/qsort_arg.c
--- b/src/port/qsort_arg.c
***************
*** 7,13 ****
   *	  Remove ill-considered "swap_cnt" switch to insertion sort,
   *	  in favor of a simple check for presorted input.
   *
!  *	CAUTION: if you change this file, see also qsort.c
   *
   *	src/port/qsort_arg.c
   */
--- 7,13 ----
   *	  Remove ill-considered "swap_cnt" switch to insertion sort,
   *	  in favor of a simple check for presorted input.
   *
!  *	CAUTION: if you change this file, see also qsort.c and template_qsort_arg.h
   *
   *	src/port/qsort_arg.c
   */

#69

robertmhaas@gmail.com

almost 14 years ago

In reply to: Peter Geoghegan (#68)

2 attachment(s)

Re: Progress on fast path sorting, btree index creation time

On Fri, Feb 10, 2012 at 10:30 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote:

[ new patch ]

I spent quite a bit of time looking at this today - the patch
specifically, and the issue of making quicksort fast more generally.
It seemed to me that if we're going to have separate copies of the
quicksort code for tuple sorting, we might as well go whole hog and
specialize those copies to the particular needs of tuplesort.c as much
as possible. Accordingly, I whacked the code around so that it knows
that it is sorting SortTuple objects and need not conditionalize at
runtime on the size of the objects being swapped. You suggested
upthread that this might be worthwhile, and it seems that it is, so I
think we should do it.

Your patch removes the CHECK_FOR_INTERRUPTS() call from
comparetup_heap, which is no good. However, since I'd already decided
to specialize the copies of quicksort intended for sorting as much as
possible, it made sense to me to refactor things so that the qsort
routine itself, rather than the comparator, is responsible for calling
CHECK_FOR_INTERRUPTS(). This slightly reduces the number of times we
CHECK_FOR_INTERRUPTS(), but never allows more than a few comparisons
before doing it.

I find that your pg_always_inline macro is equivalent to just plain
inline on my system (MacOS X v10.6.8, gcc 4.2.1). It seems to need
something like this:

+#elif __GNUC__
+#define pg_always_inline inline __attribute__((always_inline))

...but I'm not very happy about relying on that, because I don't know
that it will work on every gcc version (never mind non-gcc compilers),
and I'm not convinced it's actually improving performance even on this
one. The documentation seems to indicate that this is intended to
force inlining even when not optimizing, which may have something to
do with the lack of effect: that's not really the point here anyway.
What I did instead is to replace template_qsort_arg.h with a script
called gen_qsort_tuple.pl, which generates a file called qsort_tuple.c
that tuplesort.c then #includes. This seems more flexible to me than
the macro-based approach. In particular, it allows me to generate
versions of qsort with different call signatures. The attached patch
generates two:

static void qsort_tuple(SortTuple *a, size_t n, SortTupleComparator
cmp_tuple, Tuplesortstate *state);
static void qsort_ssup(SortTuple *a, size_t n, SortSupport ssup);

The first of these is a drop-in replacement for qsort_arg() - any
tuplesort can use it, not just heap sorts. But it is faster than
qsort_arg() because of the specializations for the SortTuple data
type. The second handles the special case where we are sorting by a
single key that has an associated SortSupport object. In this case we
don't need to carry the overhead of passing around the Tuplesortstate
and dereferencing it, nor do we need the SortTupleComparator: we can
just pass the SortSupport itself. Maybe there's a way to get this
effect using macros, but I couldn't figure it out. At any rate, at
least for the single-key case, this approach effectively forces the
comparator to be inlined without requiring pg_always_inline.

With this patch, I get the following results, as compared with your
2012-02-10 version and master, using the same test cases I tested
before.

select * from nodups order by g offset 10001;
tps on master: 289.471274, 289.967984, 289.595958
tps on 2012-02-10 version: 359.150280, 356.284723, 356.888900
tps on attached version: 388.212793, 386.085083, 386.867478

select * from twocol order by a, b offset 10000;
tps on master: 261.676611, 260.440886, 259.529362
tps on 2012-02-10 version: 283.941312, 279.981723, 283.140208
tps on attached version: 283.146463, 278.344827, 280.727798

select * from f8 order by g offset 10000;
tps on master: 228.299924, 222.650355, 227.408506
tps on 2012-02-10 version: 260.289273, 257.181035, 256.377456
tps on attached version: 276.985299, 275.341575, 274.428095

There's some variation (which I can't account for) between the results
on master now and the results on master before - possibly just code
shifting around between cache lines due to unrelated changes, or maybe
some inadvertent change in my test setup. But it looks to me like
your 2012-02-10 version, without any type-specific optimizations, does
pretty much just as well on multi-key sorting as your previous
version, which had them - or if there is a difference, it's pretty
small.

Overall, I think the numbers for the version I'm attaching here look
pretty good: the single-key performance is clearly better than your
last version, and the multi-key performance is very slightly worse. I
think that slight worsening is a good trade-off, though, because this
version can use qsort_tuple() for all kinds of tuplesorts, not just
heap tuplesorts. Still, it seems like we ought to be able to do even
better: the multi-key specialization that you had in your patch can be
coded in this framework, too, and in theory those are ndependent of
the swapcode improvements. I tried coding up a multi-key
specialization which I believe to be quite similar to what you did,
but it didn't seem to do much. I'm attaching it here; maybe there's
some way to improve it (or a different test case where it pays off).

It strikes me that if we wanted to take this further, we could look at
squeezing out ApplySortComparator. For example, suppose that, upon
discovering that we can do an in-memory quicksort on a single sort
key, we make an initial pass over the data where we check whether it's
sorted and, as we go, swap all the entries with isnull1 = true to the
end of the memtuples array. We then sort the isnull1 = true entries
with the standard comparator, and the isnull1 = false entries with an
optimized comparator that elides most of ApplySortComparator and
instead just calls the comparison function directly. We then decide
on whether to emit the isnull1 = true entries first or last based on
NULLS FIRST/LAST, and decide whether to emit the remaining entries in
forward or reverse order based on ASC/DESC. Or maybe not exactly that
thing, but something like that, so that we sort the null and non-null
entries separately. The additional complexity in the read-out logic
would probably be more than offset by being able to use a simpler
comparator.

What do you think of this version?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

fastpath_sort_2012_02_14.patchapplication/octet-stream; name=fastpath_sort_2012_02_14.patchDownload

diff --git a/src/backend/Makefile b/src/backend/Makefile
index 0c763dd..01bb6e1 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -202,6 +202,7 @@ distprep:
 	$(MAKE) -C replication	repl_gram.c repl_scanner.c
 	$(MAKE) -C utils	fmgrtab.c fmgroids.h errcodes.h
 	$(MAKE) -C utils/misc	guc-file.c
+	$(MAKE) -C utils/sort	qsort_tuple.c
 
 
 ##########################################################################
@@ -315,7 +316,8 @@ maintainer-clean: distclean
 	      utils/fmgroids.h \
 	      utils/fmgrtab.c \
 	      utils/errcodes.h \
-	      utils/misc/guc-file.c
+	      utils/misc/guc-file.c \
+	      utils/misc/qsort_tuple.c
 
 
 ##########################################################################
diff --git a/src/backend/utils/sort/.gitignore b/src/backend/utils/sort/.gitignore
new file mode 100644
index 0000000..f295863
--- /dev/null
+++ b/src/backend/utils/sort/.gitignore
@@ -0,0 +1 @@
+/qsort_tuple.c
diff --git a/src/backend/utils/sort/Makefile b/src/backend/utils/sort/Makefile
index 2ef4965..f46ce41 100644
--- a/src/backend/utils/sort/Makefile
+++ b/src/backend/utils/sort/Makefile
@@ -14,4 +14,12 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = logtape.o sortsupport.o tuplesort.o tuplestore.o
 
+tuplesort.o: qsort_tuple.c
+
+qsort_tuple.c: gen_qsort_tuple.pl
+	$(PERL) $(srcdir)/gen_qsort_tuple.pl $< > $@
+
 include $(top_srcdir)/src/backend/common.mk
+
+maintainer-clean:
+	rm -f qsort_tuple.c
diff --git a/src/backend/utils/sort/gen_qsort_tuple.pl b/src/backend/utils/sort/gen_qsort_tuple.pl
new file mode 100755
index 0000000..40d5548
--- /dev/null
+++ b/src/backend/utils/sort/gen_qsort_tuple.pl
@@ -0,0 +1,232 @@
+#!/usr/bin/perl -w
+
+#
+# gen_qsort_tuple.pl
+#
+# This script generates specialized versions of the quicksort algorithm for
+# tuple sorting.  The quicksort code is derived from the NetBSD code.  The
+# code generated by this script runs significantly faster than vanilla qsort
+# when used to sort tuples.  This speedup comes from a number of places.
+# The major effects are (1) inlining simple tuple comparators is much faster
+# than jumping through a function pointer and (2) swap and vecswap operations
+# specialized to the particular data type of interest (in this case, SortTuple)
+# are faster than the generic routines.
+#
+#	Modifications from vanilla NetBSD source:
+#	  Add do ... while() macro fix
+# 	  Remove __inline, _DIAGASSERTs, __P
+# 	  Remove ill-considered "swap_cnt" switch to insertion sort,
+# 	  in favor of a simple check for presorted input.
+#     Instead of sorting arbitrary objects, we're always sorting SortTuples
+#     Add CHECK_FOR_INTERRUPTS()
+#
+# CAUTION: if you change this file, see also qsort.c and qsort_arg.c
+#
+
+use strict;
+
+my $SUFFIX;
+my $EXTRAARGS;
+my $EXTRAPARAMS;
+my $CMPPARAMS;
+
+emit_qsort_boilerplate();
+
+$SUFFIX = 'tuple';
+$EXTRAARGS = ', SortTupleComparator cmp_tuple, Tuplesortstate *state';
+$EXTRAPARAMS = ', cmp_tuple, state';
+$CMPPARAMS = ', state';
+emit_qsort_implementation();
+
+$SUFFIX = 'ssup';
+$EXTRAARGS = ', SortSupport ssup';
+$EXTRAPARAMS = ', ssup';
+$CMPPARAMS = ', ssup';
+print <<'EOM';
+#define cmp_ssup(a, b, ssup) \
+	ApplySortComparator((a)->datum1, (a)->isnull1, \
+						(b)->datum1, (b)->isnull1, ssup)
+EOM
+emit_qsort_implementation();
+
+sub emit_qsort_boilerplate
+{
+	print <<'EOM';
+/*
+ * autogenerated by src/backend/utils/sort/gen_qsort_tuple.pl, do not edit
+ * This file is included by tuplesort.c, rather than compiled separately.
+ */
+
+/*	$NetBSD: qsort.c,v 1.13 2003/08/07 16:43:42 agc Exp $	*/
+
+/*-
+ * Copyright (c) 1992, 1993
+ *	The Regents of the University of California.  All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *	  notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *	  notice, this list of conditions and the following disclaimer in the
+ *	  documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of the University nor the names of its contributors
+ *	  may be used to endorse or promote products derived from this software
+ *	  without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.	IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+/*
+ * Qsort routine based on J. L. Bentley and M. D. McIlroy,
+ * "Engineering a sort function",
+ * Software--Practice and Experience 23 (1993) 1249-1265.
+ * We have modified their original by adding a check for already-sorted input,
+ * which seems to be a win per discussions on pgsql-hackers around 2006-03-21.
+ */
+
+static void
+swapfunc(SortTuple *a, SortTuple *b, size_t n)
+{
+	do
+	{
+		SortTuple 	t = *a;
+		*a++ = *b;
+		*b++ = t;
+	} while (--n > 0);
+}
+
+#define swap(a, b)						\
+	do { 								\
+		SortTuple t = *(a);				\
+		*(a) = *(b);					\
+		*(b) = t;						\
+	} while (0);
+
+#define vecswap(a, b, n) if ((n) > 0) swapfunc((a), (b), (size_t)(n))
+EOM
+}
+
+sub emit_qsort_implementation
+{
+	print <<EOM;
+static SortTuple *
+med3_$SUFFIX(SortTuple *a, SortTuple *b, SortTuple *c$EXTRAARGS)
+{
+	return cmp_$SUFFIX(a, b$CMPPARAMS) < 0 ?
+		(cmp_$SUFFIX(b, c$CMPPARAMS) < 0 ? b :
+			(cmp_$SUFFIX(a, c$CMPPARAMS) < 0 ? c : a))
+		: (cmp_$SUFFIX(b, c$CMPPARAMS) > 0 ? b :
+			(cmp_$SUFFIX(a, c$CMPPARAMS) < 0 ? a : c));
+}
+
+static void
+qsort_$SUFFIX(SortTuple *a, size_t n$EXTRAARGS)
+{
+	SortTuple  *pa,
+			   *pb,
+			   *pc,
+			   *pd,
+			   *pl,
+			   *pm,
+			   *pn;
+	int			d,
+				r,
+				presorted;
+
+loop:
+	CHECK_FOR_INTERRUPTS();
+	if (n < 7)
+	{
+		for (pm = a + 1; pm < a + n; pm++)
+			for (pl = pm; pl > a && cmp_$SUFFIX(pl - 1, pl$CMPPARAMS) > 0; pl--)
+				swap(pl, pl - 1);
+		return;
+	}
+	presorted = 1;
+	for (pm = a + 1; pm < a + n; pm++)
+	{
+		CHECK_FOR_INTERRUPTS();
+		if (cmp_$SUFFIX(pm - 1, pm$CMPPARAMS) > 0)
+		{
+			presorted = 0;
+			break;
+		}
+	}
+	if (presorted)
+		return;
+	pm = a + (n / 2);
+	if (n > 7)
+	{
+		pl = a;
+		pn = a + (n - 1);
+		if (n > 40)
+		{
+			d = (n / 8);
+			pl = med3_$SUFFIX(pl, pl + d, pl + 2 * d$EXTRAPARAMS);
+			pm = med3_$SUFFIX(pm - d, pm, pm + d$EXTRAPARAMS);
+			pn = med3_$SUFFIX(pn - 2 * d, pn - d, pn$EXTRAPARAMS);
+		}
+		pm = med3_$SUFFIX(pl, pm, pn$EXTRAPARAMS);
+	}
+	swap(a, pm);
+	pa = pb = a + 1;
+	pc = pd = a + (n - 1);
+	for (;;)
+	{
+		while (pb <= pc && (r = cmp_$SUFFIX(pb, a$CMPPARAMS)) <= 0)
+		{
+			CHECK_FOR_INTERRUPTS();
+			if (r == 0)
+			{
+				swap(pa, pb);
+				pa++;
+			}
+			pb++;
+		}
+		while (pb <= pc && (r = cmp_$SUFFIX(pc, a$CMPPARAMS)) >= 0)
+		{
+			CHECK_FOR_INTERRUPTS();
+			if (r == 0)
+			{
+				swap(pc, pd);
+				pd--;
+			}
+			pc--;
+		}
+		if (pb > pc)
+			break;
+		swap(pb, pc);
+		pb++;
+		pc--;
+	}
+	pn = a + n;
+	r = Min(pa - a, pb - pa);
+	vecswap(a, pb - r, r);
+	r = Min(pd - pc, pn - pd - 1);
+	vecswap(pb, pn - r, r);
+	if ((r = pb - pa) > 1)
+		qsort_$SUFFIX(a, r$EXTRAPARAMS);
+	if ((r = pd - pc) > 1)
+	{
+		/* Iterate rather than recurse to save stack space */
+		a = pn - r;
+		n = r;
+		goto loop;
+	}
+/*		qsort_$SUFFIX(pn - r, r$EXTRAPARAMS);*/
+}
+
+EOM
+}
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 1452e8c..10b19c0 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -195,6 +195,9 @@ typedef enum
 #define TAPE_BUFFER_OVERHEAD		(BLCKSZ * 3)
 #define MERGE_BUFFER_SIZE			(BLCKSZ * 32)
 
+typedef int	(*SortTupleComparator) (const SortTuple *a, const SortTuple *b,
+	Tuplesortstate *state);
+
 /*
  * Private state of a Tuplesort operation.
  */
@@ -223,8 +226,7 @@ struct Tuplesortstate
 	 * <0, 0, >0 according as a<b, a=b, a>b.  The API must match
 	 * qsort_arg_comparator.
 	 */
-	int			(*comparetup) (const SortTuple *a, const SortTuple *b,
-										   Tuplesortstate *state);
+	SortTupleComparator	comparetup;
 
 	/*
 	 * Function to copy a supplied input tuple into palloc'd space and set up
@@ -363,12 +365,14 @@ struct Tuplesortstate
 	/* These are specific to the index_hash subcase: */
 	uint32		hash_mask;		/* mask for sortable part of hash code */
 
+	/* This is initialized when, and only when, there's just one key. */
+	SortSupport	onlyKey;
+
 	/*
 	 * These variables are specific to the Datum case; they are set by
 	 * tuplesort_begin_datum and used only by the DatumTuple routines.
 	 */
 	Oid			datumType;
-	SortSupport	datumKey;
 	/* we need typelen and byval in order to know how to copy the Datums. */
 	int			datumTypeLen;
 	bool		datumTypeByVal;
@@ -492,6 +496,11 @@ static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
 static void reversedirection_datum(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
 
+/*
+ * Special version of qsort, just for SortTuple objects.
+ */
+#include "qsort_tuple.c"
+
 
 /*
  *		tuplesort_begin_xxx
@@ -631,6 +640,9 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 		PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
 	}
 
+	if (nkeys == 1)
+		state->onlyKey = state->sortKeys;
+
 	MemoryContextSwitchTo(oldcontext);
 
 	return state;
@@ -809,13 +821,13 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	state->datumType = datumType;
 
 	/* Prepare SortSupport data */
-	state->datumKey = (SortSupport) palloc0(sizeof(SortSupportData));
+	state->onlyKey = (SortSupport) palloc0(sizeof(SortSupportData));
 
-	state->datumKey->ssup_cxt = CurrentMemoryContext;
-	state->datumKey->ssup_collation = sortCollation;
-	state->datumKey->ssup_nulls_first = nullsFirstFlag;
+	state->onlyKey->ssup_cxt = CurrentMemoryContext;
+	state->onlyKey->ssup_collation = sortCollation;
+	state->onlyKey->ssup_nulls_first = nullsFirstFlag;
 
-	PrepareSortSupportFromOrderingOp(sortOperator, state->datumKey);
+	PrepareSortSupportFromOrderingOp(sortOperator, state->onlyKey);
 
 	/* lookup necessary attributes of the datum type */
 	get_typlenbyval(datumType, &typlen, &typbyval);
@@ -1222,11 +1234,16 @@ tuplesort_performsort(Tuplesortstate *state)
 			 * amount of memory.  Just qsort 'em and we're done.
 			 */
 			if (state->memtupcount > 1)
-				qsort_arg((void *) state->memtuples,
-						  state->memtupcount,
-						  sizeof(SortTuple),
-						  (qsort_arg_comparator) state->comparetup,
-						  (void *) state);
+			{
+				if (state->onlyKey != NULL)
+					qsort_ssup(state->memtuples, state->memtupcount,
+							   state->onlyKey);
+				else
+					qsort_tuple(state->memtuples,
+								state->memtupcount,
+								state->comparetup,
+								state);
+			}
 			state->current = 0;
 			state->eof_reached = false;
 			state->markpos_offset = 0;
@@ -2660,9 +2677,6 @@ comparetup_heap(const SortTuple *a, const SortTuple *b, Tuplesortstate *state)
 	int			nkey;
 	int32		compare;
 
-	/* Allow interrupting long sorts */
-	CHECK_FOR_INTERRUPTS();
-
 	/* Compare the leading sort key */
 	compare = ApplySortComparator(a->datum1, a->isnull1,
 								  b->datum1, b->isnull1,
@@ -2804,9 +2818,6 @@ comparetup_cluster(const SortTuple *a, const SortTuple *b,
 	int			nkey;
 	int32		compare;
 
-	/* Allow interrupting long sorts */
-	CHECK_FOR_INTERRUPTS();
-
 	/* Compare the leading sort key, if it's simple */
 	if (state->indexInfo->ii_KeyAttrNumbers[0] != 0)
 	{
@@ -2995,9 +3006,6 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 	int			nkey;
 	int32		compare;
 
-	/* Allow interrupting long sorts */
-	CHECK_FOR_INTERRUPTS();
-
 	/* Compare the leading sort key */
 	compare = inlineApplySortFunction(&scanKey->sk_func, scanKey->sk_flags,
 									  scanKey->sk_collation,
@@ -3102,9 +3110,6 @@ comparetup_index_hash(const SortTuple *a, const SortTuple *b,
 	IndexTuple	tuple1;
 	IndexTuple	tuple2;
 
-	/* Allow interrupting long sorts */
-	CHECK_FOR_INTERRUPTS();
-
 	/*
 	 * Fetch hash keys and mask off bits we don't want to sort by. We know
 	 * that the first column of the index tuple is the hash key.
@@ -3231,12 +3236,9 @@ reversedirection_index_hash(Tuplesortstate *state)
 static int
 comparetup_datum(const SortTuple *a, const SortTuple *b, Tuplesortstate *state)
 {
-	/* Allow interrupting long sorts */
-	CHECK_FOR_INTERRUPTS();
-
-	return ApplySortComparator(a->datum1, a->isnull1,
-							   b->datum1, b->isnull1,
-							   state->datumKey);
+	/* Not currently needed */
+	elog(ERROR, "comparetup_datum() should not be called");
+	return 0;
 }
 
 static void
@@ -3328,8 +3330,8 @@ readtup_datum(Tuplesortstate *state, SortTuple *stup,
 static void
 reversedirection_datum(Tuplesortstate *state)
 {
-	state->datumKey->ssup_reverse = !state->datumKey->ssup_reverse;
-	state->datumKey->ssup_nulls_first = !state->datumKey->ssup_nulls_first;
+	state->onlyKey->ssup_reverse = !state->onlyKey->ssup_reverse;
+	state->onlyKey->ssup_nulls_first = !state->onlyKey->ssup_nulls_first;
 }
 
 /*
diff --git a/src/port/qsort.c b/src/port/qsort.c
index 8e2c6d9..49d8fa7 100644
--- a/src/port/qsort.c
+++ b/src/port/qsort.c
@@ -7,7 +7,7 @@
  *	  Remove ill-considered "swap_cnt" switch to insertion sort,
  *	  in favor of a simple check for presorted input.
  *
- *	CAUTION: if you change this file, see also qsort_arg.c
+ *	CAUTION: if you change this file, see also qsort_arg.c, gen_qsort_tuple.pl
  *
  *	src/port/qsort.c
  */
diff --git a/src/port/qsort_arg.c b/src/port/qsort_arg.c
index 28d1894..3091eb0 100644
--- a/src/port/qsort_arg.c
+++ b/src/port/qsort_arg.c
@@ -7,7 +7,7 @@
  *	  Remove ill-considered "swap_cnt" switch to insertion sort,
  *	  in favor of a simple check for presorted input.
  *
- *	CAUTION: if you change this file, see also qsort.c
+ *	CAUTION: if you change this file, see also qsort.c, gen_qsort_tuple.pl
  *
  *	src/port/qsort_arg.c
  */

multi-key.patchapplication/octet-stream; name=multi-key.patchDownload

commit 4a4fb12548eb2533f8804a691daad4bfb53302f4
Author: Robert Haas <rhaas@postgresql.org>
Date:   Tue Feb 14 23:45:06 2012 -0500

    Revert "Revert "Multi-key specializations.""
    
    This reverts commit 6319911c04ab827449aab0c68cb7d5966825746e.

diff --git a/src/backend/utils/sort/gen_qsort_tuple.pl b/src/backend/utils/sort/gen_qsort_tuple.pl
index 40d5548..a31a3e2 100755
--- a/src/backend/utils/sort/gen_qsort_tuple.pl
+++ b/src/backend/utils/sort/gen_qsort_tuple.pl
@@ -49,6 +49,25 @@ print <<'EOM';
 EOM
 emit_qsort_implementation();
 
+$SUFFIX = 'heap';
+$EXTRAARGS = ', Tuplesortstate *state';
+$EXTRAPARAMS = ', state';
+$CMPPARAMS = ', state';
+print <<'EOM';
+static pg_always_inline int
+cmp_heap(SortTuple *a, SortTuple *b, Tuplesortstate *state)
+{
+	int compare;
+
+	compare = ApplySortComparator(a->datum1, a->isnull1,
+								  b->datum1, b->isnull1, state->sortKeys);
+	if (compare == 0)
+		compare = state->comparetup_rest(a, b, state);
+	return compare;
+}
+EOM
+emit_qsort_implementation();
+
 sub emit_qsort_boilerplate
 {
 	print <<'EOM';
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 10b19c0..cdaa230 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -229,6 +229,12 @@ struct Tuplesortstate
 	SortTupleComparator	comparetup;
 
 	/*
+	 * Function to compare the non-leading columns of two tuples, as per
+	 * above.
+	 */
+	SortTupleComparator	comparetup_rest;
+
+	/*
 	 * Function to copy a supplied input tuple into palloc'd space and set up
 	 * its SortTuple representation (ie, set tuple/datum1/isnull1).  Also,
 	 * state->availMem must be decreased by the amount of space used for the
@@ -462,6 +468,8 @@ static unsigned int getlen(Tuplesortstate *state, int tapenum, bool eofOK);
 static void markrunend(Tuplesortstate *state, int tapenum);
 static int comparetup_heap(const SortTuple *a, const SortTuple *b,
 				Tuplesortstate *state);
+static int comparetup_heap_rest(const SortTuple *a, const SortTuple *b,
+				Tuplesortstate *state);
 static void copytup_heap(Tuplesortstate *state, SortTuple *stup, void *tup);
 static void writetup_heap(Tuplesortstate *state, int tapenum,
 			  SortTuple *stup);
@@ -615,6 +623,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 								randomAccess);
 
 	state->comparetup = comparetup_heap;
+	state->comparetup_rest = comparetup_heap_rest;
 	state->copytup = copytup_heap;
 	state->writetup = writetup_heap;
 	state->readtup = readtup_heap;
@@ -1238,6 +1247,8 @@ tuplesort_performsort(Tuplesortstate *state)
 				if (state->onlyKey != NULL)
 					qsort_ssup(state->memtuples, state->memtupcount,
 							   state->onlyKey);
+				else if (state->comparetup_rest != NULL)
+					qsort_heap(state->memtuples, state->memtupcount, state);
 				else
 					qsort_tuple(state->memtuples,
 								state->memtupcount,
@@ -2712,6 +2723,49 @@ comparetup_heap(const SortTuple *a, const SortTuple *b, Tuplesortstate *state)
 	return 0;
 }
 
+static int
+comparetup_heap_rest(const SortTuple *a, const SortTuple *b,
+					 Tuplesortstate *state)
+{
+	SortSupport	sortKey = state->sortKeys;
+	HeapTupleData ltup;
+	HeapTupleData rtup;
+	TupleDesc	tupDesc;
+	int			nkey;
+	int32		compare = 0;
+
+	/*
+	 * We don't have to compare the leading sort key, because qsort_heap will
+	 * have already done that.  Our join is to compare the remaining sort keys
+	 * to break ties.
+	 */
+	ltup.t_len = ((MinimalTuple) a->tuple)->t_len + MINIMAL_TUPLE_OFFSET;
+	ltup.t_data = (HeapTupleHeader) ((char *) a->tuple - MINIMAL_TUPLE_OFFSET);
+	rtup.t_len = ((MinimalTuple) b->tuple)->t_len + MINIMAL_TUPLE_OFFSET;
+	rtup.t_data = (HeapTupleHeader) ((char *) b->tuple - MINIMAL_TUPLE_OFFSET);
+	tupDesc = state->tupDesc;
+	sortKey++;
+	for (nkey = 1; nkey < state->nKeys; nkey++, sortKey++)
+	{
+		AttrNumber	attno = sortKey->ssup_attno;
+		Datum		datum1,
+					datum2;
+		bool		isnull1,
+					isnull2;
+
+		datum1 = heap_getattr(&ltup, attno, tupDesc, &isnull1);
+		datum2 = heap_getattr(&rtup, attno, tupDesc, &isnull2);
+
+		compare = ApplySortComparator(datum1, isnull1,
+									  datum2, isnull2,
+									  sortKey);
+		if (compare != 0)
+			return compare;
+	}
+
+	return 0;
+}
+
 static void
 copytup_heap(Tuplesortstate *state, SortTuple *stup, void *tup)
 {
diff --git a/src/include/c.h b/src/include/c.h
index 7396adb..ecb1a60 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -850,4 +850,18 @@ extern int	fdatasync(int fildes);
 /* /port compatibility functions */
 #include "port.h"
 
+/*
+ * Define a cross-platform "always-inline" macro. This is a very sharp tool that
+ * should be used judiciously.
+ */
+#ifdef __always_inline
+#define pg_always_inline __always_inline
+#elif defined(__force_inline)
+#define pg_always_inline __force_inline
+#elif __GNUC__
+#define pg_always_inline inline __attribute__((always_inline))
+#else
+#define pg_always_inline inline
+#endif
+
 #endif   /* C_H */

#70

peter@2ndquadrant.com

almost 14 years ago

In reply to: Robert Haas (#69)

Re: Progress on fast path sorting, btree index creation time

On 15 February 2012 06:16, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Feb 10, 2012 at 10:30 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote:

[ new patch ]

I spent quite a bit of time looking at this today - the patch
specifically, and the issue of making quicksort fast more generally.
It seemed to me that if we're going to have separate copies of the
quicksort code for tuple sorting, we might as well go whole hog and
specialize those copies to the particular needs of tuplesort.c as much
as possible. Accordingly, I whacked the code around so that it knows
that it is sorting SortTuple objects and need not conditionalize at
runtime on the size of the objects being swapped. You suggested
upthread that this might be worthwhile, and it seems that it is, so I
think we should do it.

Cool. I agree that we should do this. It doesn't need to be justified
as a performance optimisation - it makes sense to refactor in this
way. If that makes things faster, then so much the better.

Your patch removes the CHECK_FOR_INTERRUPTS() call from
comparetup_heap, which is no good. However, since I'd already decided
to specialize the copies of quicksort intended for sorting as much as
possible, it made sense to me to refactor things so that the qsort
routine itself, rather than the comparator, is responsible for calling
CHECK_FOR_INTERRUPTS(). This slightly reduces the number of times we
CHECK_FOR_INTERRUPTS(), but never allows more than a few comparisons
before doing it.

Well, the idea of that was to have one and only CHECK_FOR_INTERRUPTS()
call in the leading comparator. I should perhaps have further stressed
that that patch was only intended to establish the idea of having that
function pointer call within an inlined "leading-key's comparator
calling" outer comparator.

I find that your pg_always_inline macro is equivalent to just plain
inline on my system (MacOS X v10.6.8, gcc 4.2.1). It seems to need
something like this:
+#elif __GNUC__
+#define pg_always_inline inline __attribute__((always_inline))
...but I'm not very happy about relying on that, because I don't know
that it will work on every gcc version (never mind non-gcc compilers),
and I'm not convinced it's actually improving performance even on this
one. The documentation seems to indicate that this is intended to
force inlining even when not optimizing, which may have something to
do with the lack of effect: that's not really the point here anyway.

There is a scant reference that suggests this, yes, but that's
contradicted by other scanty sources. The macro was observed to be
helpful on the multi-key case, and the effect was quite apparent and
reproducible. At any rate its appearance in my most recenty patch was
vestigial - I think that there ought to be no need for it, now that
the compiler cost/benefit analysis probably has things right by
inlining.

What I did instead is to replace template_qsort_arg.h with a script
called gen_qsort_tuple.pl, which generates a file called qsort_tuple.c
that tuplesort.c then #includes. This seems more flexible to me than
the macro-based approach. In particular, it allows me to generate
versions of qsort with different call signatures. The attached patch
generates two:

static void qsort_tuple(SortTuple *a, size_t n, SortTupleComparator
cmp_tuple, Tuplesortstate *state);
static void qsort_ssup(SortTuple *a, size_t n, SortSupport ssup);

The first of these is a drop-in replacement for qsort_arg() - any
tuplesort can use it, not just heap sorts. But it is faster than
qsort_arg() because of the specializations for the SortTuple data
type. The second handles the special case where we are sorting by a
single key that has an associated SortSupport object. In this case we
don't need to carry the overhead of passing around the Tuplesortstate
and dereferencing it, nor do we need the SortTupleComparator: we can
just pass the SortSupport itself. Maybe there's a way to get this
effect using macros, but I couldn't figure it out.

I've seen the pattern where generic programming is aped with the
preprocessor in in the style of my patch in, of all things, wxWidgets,
where it continues to be used as part of pgAdmin, or was when I worked
on wxWidgets 3 support exactly one year ago. One thing that is
particularly bizarre about that code is that clients actually #include
a cpp file. This is, I'd guess, a legacy of having to target pre C++98
compilers without decent template support. MSVC 6, in particular, was
completely inadequate in this area.

I am inclined to agree that given that we already use Perl to generate
source code like this, it seems natural that we should prefer to do
that, if only to avoid paranoia about the inclusion of a dial-a-bloat
knob. I am at a disadvantage here, since I've never written a line of
Perl.

With this patch, I get the following results, as compared with your
2012-02-10 version and master, using the same test cases I tested
before.

select * from nodups order by g offset 10001;
tps on master: 289.471274, 289.967984, 289.595958
tps on 2012-02-10 version: 359.150280, 356.284723, 356.888900
tps on attached version: 388.212793, 386.085083, 386.867478

select * from twocol order by a, b offset 10000;
tps on master: 261.676611, 260.440886, 259.529362
tps on 2012-02-10 version: 283.941312, 279.981723, 283.140208
tps on attached version: 283.146463, 278.344827, 280.727798

select * from f8 order by g offset 10000;
tps on master: 228.299924, 222.650355, 227.408506
tps on 2012-02-10 version: 260.289273, 257.181035, 256.377456
tps on attached version: 276.985299, 275.341575, 274.428095

There's some variation (which I can't account for) between the results
on master now and the results on master before - possibly just code
shifting around between cache lines due to unrelated changes, or maybe
some inadvertent change in my test setup. But it looks to me like
your 2012-02-10 version, without any type-specific optimizations, does
pretty much just as well on multi-key sorting as your previous
version, which had them - or if there is a difference, it's pretty
small.

I'd say that that's a fair assessment. It does just as well in that
case, but with far fewer additional instructions.

Overall, I think the numbers for the version I'm attaching here look
pretty good: the single-key performance is clearly better than your
last version, and the multi-key performance is very slightly worse. I
think that slight worsening is a good trade-off, though, because this
version can use qsort_tuple() for all kinds of tuplesorts, not just
heap tuplesorts. Still, it seems like we ought to be able to do even
better: the multi-key specialization that you had in your patch can be
coded in this framework, too, and in theory those are ndependent of
the swapcode improvements. I tried coding up a multi-key
specialization which I believe to be quite similar to what you did,
but it didn't seem to do much. I'm attaching it here; maybe there's
some way to improve it (or a different test case where it pays off).

Hmm. I'll run some tests myself when I get a chance, and attempt to
determine what's going on here. I don't have immediate access to a
good benchmarking server, which would be preferable, and I'd like to
post a revision of pg_stat_statements normalisation today, as it's
been too long. Nice work though.

It strikes me that if we wanted to take this further, we could look at
squeezing out ApplySortComparator. For example, suppose that, upon
discovering that we can do an in-memory quicksort on a single sort
key, we make an initial pass over the data where we check whether it's
sorted and, as we go, swap all the entries with isnull1 = true to the
end of the memtuples array. We then sort the isnull1 = true entries
with the standard comparator, and the isnull1 = false entries with an
optimized comparator that elides most of ApplySortComparator and
instead just calls the comparison function directly. We then decide
on whether to emit the isnull1 = true entries first or last based on
NULLS FIRST/LAST, and decide whether to emit the remaining entries in
forward or reverse order based on ASC/DESC. Or maybe not exactly that
thing, but something like that, so that we sort the null and non-null
entries separately. The additional complexity in the read-out logic
would probably be more than offset by being able to use a simpler
comparator.

While it's really easy to be wrong about these things, I'd venture to
guess that branch prediction makes this a less than compelling win in
practice. That said, if you want to know how good a win it might be,
you need only knock out a quick-and-dirty prototype and see how fast a
sort is - however well this does will be pretty close to your best
case, but not quite, since presumably such a prototype wouldn't worry
about the applicability of the optimisation.

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

#71

robertmhaas@gmail.com

almost 14 years ago

In reply to: Peter Geoghegan (#70)

Re: Progress on fast path sorting, btree index creation time

On Wed, Feb 15, 2012 at 8:29 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote:

Cool. I agree that we should do this. It doesn't need to be justified
as a performance optimisation - it makes sense to refactor in this
way. If that makes things faster, then so much the better.

Well, maybe so, but I think if the performance had been the same I
might not have messed with it.

I am inclined to agree that given that we already use Perl to generate
source code like this, it seems natural that we should prefer to do
that, if only to avoid paranoia about the inclusion of a dial-a-bloat
knob. I am at a disadvantage here, since I've never written a line of
Perl.

I think it's still dial-a-bloat, but I feel pretty comfortable about
how we've got that knob adjusted in this version. It's almost as much
improvement as any previous version, it applies to more cases, and the
code footprint is the least of any version I've measured.

Hmm. I'll run some tests myself when I get a chance, and attempt to
determine what's going on here. I don't have immediate access to a
good benchmarking server, which would be preferable, and I'd like to
post a revision of pg_stat_statements normalisation today, as it's
been too long. Nice work though.

Thanks.

It strikes me that if we wanted to take this further, we could look at
squeezing out ApplySortComparator. For example, suppose that, upon
discovering that we can do an in-memory quicksort on a single sort
key, we make an initial pass over the data where we check whether it's
sorted and, as we go, swap all the entries with isnull1 = true to the
end of the memtuples array. We then sort the isnull1 = true entries
with the standard comparator, and the isnull1 = false entries with an
optimized comparator that elides most of ApplySortComparator and
instead just calls the comparison function directly. We then decide
on whether to emit the isnull1 = true entries first or last based on
NULLS FIRST/LAST, and decide whether to emit the remaining entries in
forward or reverse order based on ASC/DESC. Or maybe not exactly that
thing, but something like that, so that we sort the null and non-null
entries separately. The additional complexity in the read-out logic
would probably be more than offset by being able to use a simpler
comparator.

While it's really easy to be wrong about these things, I'd venture to
guess that branch prediction makes this a less than compelling win in
practice. That said, if you want to know how good a win it might be,
you need only knock out a quick-and-dirty prototype and see how fast a
sort is - however well this does will be pretty close to your best
case, but not quite, since presumably such a prototype wouldn't worry
about the applicability of the optimisation.

You might be right. But note that the swapcode improvements were also
vulnerable to branch prediction, too, though there could also be other
effects there. One effect that should perhaps be considered is the
cost of saving and restoring registers. Even if the branch itself
doesn't cost much because we know what the result will be, the
operands still have to be loaded into registers, which isn't free of
itself and also potentially forces those registers to be saved and
restored across function calls. Still, I'm inclined not to poke at
this right now: there are a lot of patches that still need to be
looked at, and at the rate we're currently disposing of patches this
CommitFest will still be ongoing come Easter.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#72