Progress on fast path sorting, btree index creation time
I'll try and keep this terse. I've promised to justify the number of
specialisations that are in my fast-path sorting patch, and I may yet
conclude that a reduction is appropriate. Not today though - there are
quite a few ideas in the patch (even more now), and not all have been
exhaustively evaluated. Maybe that isn't what some had hoped for right
now, but deferring publicly and definitively answering those questions
to focus on tweaking the patch has brought additional benefits. I felt
that I went too long without checking in with the community though.
The purpose of this e-mail is to:
1. Report back a further improvement in the performance benefits seen
when sorting with multiple sortkeys. The performance benefits seen for
that case were previously relatively modest; they're better now.
2. Report improvements from applying these techniques to btree index
tuplesorting (short version: they're also quite good). The data used
is exactly the same as it was in my previous benchmark; orderlines is
1538MB, and has lots of duplicates. The environment is also identical.
3. Resolve two anticipated controversies that are, respectively,
somewhat orthogonal and completely orthogonal to the binary bloat
controversy. The first (controversy A) is that I have added a new
piece of infrastructure, pg_always_inline, which, as the name
suggests, is a portable way of insisting that a function should be
invariably inlined. Portable libraries like libc have been doing this
for a long time now, and I actually use the libc macro (that expands
to __attribute__((always_inline)) ) where possible. The second
possible point of contention (controversy B) is that I have jettisoned
various protections against bad qsort implementations that I believe
are a legacy of when we used the system qsort pre-2006, that can no
longer be justified. For example, quick sort performing badly in the
face of lots of duplicates is a well understood problem today
(partitioning should stop on equal keys), and ISTM that protecting
against that outside the qsort implementation (but only for index
sorts) is wrong-headed.
The first possibly controversial adjustment (controversy A) is clearly
also the reason for (1) - that has been well isolated. I haven't got
around to quantifying the performance improvement seen due to the
second possibly controversial adjustment (controversy B), but I
believe that it can be justified as a case of effectively removing
redundant code anyway. After all, we now assert against comparing a
tuple to itself anyway, and the "cheap insurance" that existed in
comparetup_index_btree was never present in any form in
comparetup_index_heap, and we heard no complaints, AFAIK.
Attached are:
* A detailing of libc's use of __always_inline /
__attribute__((always_inline)). Note that it often appears in code
that is built for all platforms.
* The WIP patch itself, rebased to integrate with the new SortSupport
infrastructure. I've gone a bit crazy with btree specialisations, but
I suspect that's where it matters least and makes most sense.
* A new Python script for bench marking index creation, that is
similar to the other one I previously posted for bench marking sorting
heap tuples.
* A spreadsheet that shows the results of re-running my earlier heap
tuple sorting benchmark with this new patch. The improvement in the
query that orders by 2 columns is all that is pertinent there, when
considering the value of (1) and the sense in standing still for
controversy A.
* A spreadsheet that shows the difference in index creation times,
generated with the help of the new python script.
Thoughts?
I had another idea when writing this patch that I haven't developed at
all but I'll share anyway. That is, it might be a good idea to use a
balance quicksort:
http://xlinux.nist.gov/dads//HTML/balancedqsrt.html
It might be possible to get a reasonable approximation of the actual
median value of a given column with existing statistics, which could
be hinted to qsort_arg. This would do a better job of guessing an
appropriate initial pivot value for qsort than the med3 sampling
technique (what we do now, advocated by Sedgewick: use the median of
the first, middle and last elements of the current partition), though
we'd still use that med3 technique to select all other pivots, and
perhaps signal to qsort_arg "you're on your own, fall back of med3 for
the initial pivot" with a null ptr, according to some heuristic.
There's obviously a not inconsiderable impedance mismatch to resolve
if we're to do that though, so that Tuplesortstate has a pointer to
the median SortTuple.
Can I get a straw poll on how much of a problem worst-case performance
of qsort is believed to be?
In a perfect world, if it were somehow possible to know the perfect
pivots ahead of time from a histogram or something, we'd have a quick
sort variant with worst-case performance of O(n log(n)). That, or the
limited approximation that I've sketched would perhaps be worthwhile,
even if it was subject to a number of caveats. Wikipedia claims that
the worst case for quick sort is O(n log(n)) with the refinements
recommended by Sedgewick's 1978 paper, but this seems like nonsense to
me - the med3 technique is a good heuristic in practice, but it's
perfectly possible in theory for it's sampling to always get things
completely wrong (this is not an unfamiliar problem). How often it
messes up in the real world and how much it matters is something that
I wouldn't like to speculate on, though it is the main factor that
will determine if the idea is worth pursuing. I can tell you that I
have heard one person observe that it had an unpredictable runtime.
However, they might not have noticed if this patch was applied,
because in general the worse quicksort does the better this patch
does. Also, the fact that we use a median of "medians" when n > 40
(Dr. Sedgewick again, if I'm not mistaken) makes me a little skeptical
of that claim. Come to think of it, it might not be a bad idea to add
a bunch of comments to qsort_arg while I have this swapped into my
head, as it currently has no comments at all.
While I'm thinking out loud, here's another idea: Have a GUC which,
when enabled (perhaps potentially at various different granularities),
makes Timsort the in-memory sorting algorithm, as it now is by default
for Python, Java SE7 (for arrays) and the Android platform; for
certain applications, this could be a real win, and I don't think that
it would have to be invasive: it could be dropped in.
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
Attachments:
tuplesort_btree_test.pytext/x-python; charset=US-ASCII; name=tuplesort_btree_test.pyDownload
libc_always_inlined_use.txttext/plain; charset=US-ASCII; name=libc_always_inlined_use.txtDownload
fastpath_sort_btree_rev.patchtext/x-patch; charset=US-ASCII; name=fastpath_sort_btree_rev.patchDownload+932-148
On Thu, Dec 29, 2011 at 8:03 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
* A spreadsheet that shows the results of re-running my earlier heap
tuple sorting benchmark with this new patch. The improvement in the
query that orders by 2 columns is all that is pertinent there, when
considering the value of (1) and the sense in standing still for
controversy A.* A spreadsheet that shows the difference in index creation times,
generated with the help of the new python script.
very nice. let me save everyone the effort of opening his
spreadsheets (which by the way both show 'HEAD/unoptimized' --
probably not what you meant): he's showing a consistent ~50% reduction
in running time of sort driven queries -- that's money.
merlin
On 30 December 2011 19:46, Merlin Moncure <mmoncure@gmail.com> wrote:
On Thu, Dec 29, 2011 at 8:03 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
* A spreadsheet that shows the results of re-running my earlier heap
tuple sorting benchmark with this new patch. The improvement in the
query that orders by 2 columns is all that is pertinent there, when
considering the value of (1) and the sense in standing still for
controversy A.* A spreadsheet that shows the difference in index creation times,
generated with the help of the new python script.very nice. let me save everyone the effort of opening his
spreadsheets (which by the way both show 'HEAD/unoptimized' --
probably not what you meant): he's showing a consistent ~50% reduction
in running time of sort driven queries -- that's money.
Sorry, I think you may have misinterpreted the results, which is my
fault - I introduced a formatting error. In the case of the "btree"
spreadsheet, the first query on each sheet should be "create index
test on orderlines (prod_id);", and not "select * from orderlines
order by prod_id". The idea is to compare the results from each set of
binaries across pages of the spreadsheet (note that there are two
tabs). You should not compare anything between the two spreadsheets.
Revised btree results attached. The heap results that I posted do not
have any formatting errors, so they have not been revised.
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
Attachments:
On Fri, Dec 30, 2011 at 2:30 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
On 30 December 2011 19:46, Merlin Moncure <mmoncure@gmail.com> wrote:
On Thu, Dec 29, 2011 at 8:03 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
* A spreadsheet that shows the results of re-running my earlier heap
tuple sorting benchmark with this new patch. The improvement in the
query that orders by 2 columns is all that is pertinent there, when
considering the value of (1) and the sense in standing still for
controversy A.* A spreadsheet that shows the difference in index creation times,
generated with the help of the new python script.very nice. let me save everyone the effort of opening his
spreadsheets (which by the way both show 'HEAD/unoptimized' --
probably not what you meant): he's showing a consistent ~50% reduction
in running time of sort driven queries -- that's money.Sorry, I think you may have misinterpreted the results, which is my
fault - I introduced a formatting error. In the case of the "btree"
spreadsheet, the first query on each sheet should be "create index
test on orderlines (prod_id);", and not "select * from orderlines
order by prod_id". The idea is to compare the results from each set of
binaries across pages of the spreadsheet (note that there are two
tabs). You should not compare anything between the two spreadsheets.
Revised btree results attached. The heap results that I posted do not
have any formatting errors, so they have not been revised.
right-- my bad. still, that's 31-37% -- still pretty nice.
merlin
On Thu, Dec 29, 2011 at 9:03 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
3. Resolve two anticipated controversies that are, respectively,
somewhat orthogonal and completely orthogonal to the binary bloat
controversy. The first (controversy A) is that I have added a new
piece of infrastructure, pg_always_inline, which, as the name
suggests, is a portable way of insisting that a function should be
invariably inlined. Portable libraries like libc have been doing this
for a long time now, and I actually use the libc macro (that expands
to __attribute__((always_inline)) ) where possible.
I don't have a problem with the idea of a pg_always_inline, but I'm
wondering what sort of fallback mechanism you propose. It seems to me
that if the performance improvement is conditioned on inlining be done
and we're not confident that we can force the compiler to inline, we
ought to fall back to not making the specialization available, rather
than to making something available that may not work well. Of course
if it's only a question of a smaller performance gain, rather than an
actual loss, then that's not as much of an issue...
The second
possible point of contention (controversy B) is that I have jettisoned
various protections against bad qsort implementations that I believe
are a legacy of when we used the system qsort pre-2006, that can no
longer be justified. For example, quick sort performing badly in the
face of lots of duplicates is a well understood problem today
(partitioning should stop on equal keys), and ISTM that protecting
against that outside the qsort implementation (but only for index
sorts) is wrong-headed.
Assuming you mean that qsort_arg() will protect against this stuff so
that callers don't need to do it separately, +1.
I had another idea when writing this patch that I haven't developed at
all but I'll share anyway. That is, it might be a good idea to use a
balance quicksort:
I can't find any description of how this actually works... obviously,
we'd like to find a close-to-median element, but how do we actually do
that cheaply?
It might be possible to get a reasonable approximation of the actual
median value of a given column with existing statistics, which could
be hinted to qsort_arg.
I doubt that this would be worthwhile -- the pivot that we pick at the
toplevel doesn't really matter much; even if it's bad, we can recover
just fine if the pivot we pick one level down is good, or the level
after that. We only really have a problem if we keep systematically
picking bad pivots every time. And if we do have that problem,
improving the toplevel choice of pivot is only going to help modestly,
because we'll still be O(n^2) on each partition.
Can I get a straw poll on how much of a problem worst-case performance
of qsort is believed to be?
I'd be reluctant to remove any of the protections we currently have,
because I bet they were all put in to fix problems that people hit in
the field. But I don't know that we need more. Of course,
consolidating them into qsort() itself rather than its callers
probably makes sense, as noted above.
In a perfect world, if it were somehow possible to know the perfect
pivots ahead of time from a histogram or something, we'd have a quick
sort variant with worst-case performance of O(n log(n)). That, or the
limited approximation that I've sketched would perhaps be worthwhile,
even if it was subject to a number of caveats. Wikipedia claims that
the worst case for quick sort is O(n log(n)) with the refinements
recommended by Sedgewick's 1978 paper, but this seems like nonsense to
me - the med3 technique is a good heuristic in practice, but it's
perfectly possible in theory for it's sampling to always get things
completely wrong (this is not an unfamiliar problem).
I agree. After reading
http://nanxkurniawan.wordpress.com/2010/05/31/quicksort/ I am inclined
to think that things are still O(n lg n) even if we always partition
so that any fixed percentage of the data -- is on one side of the
partition element and the remainder is on the other side. So for
example if all of our partition elements are greater than 1% of the
elements in their partition and less than the other 99%, we'll still
be O(n lg n), though with a considerably higher constant factor, I
think. However, if our partition elements are always a fixed NUMBER
of elements from the end - whether it's 1 or 100 - then we'll be
O(n^2), though again the constant factor will depend on how many
elements you knock off each time. In practice I'm not sure whether an
algorithmic analysis matters much, because in real life the constant
factors are probably pretty important, hence the median-of-medians
approach with n > 40.
While I'm thinking out loud, here's another idea: Have a GUC which,
when enabled (perhaps potentially at various different granularities),
makes Timsort the in-memory sorting algorithm, as it now is by default
for Python, Java SE7 (for arrays) and the Android platform; for
certain applications, this could be a real win, and I don't think that
it would have to be invasive: it could be dropped in.
I gather from a quick read that this is supposed to be especially good
when the data is already somewhat sorted. Would there be any merit in
trying to guess when that's likely to be the case (e.g. based on the
correlation statistic)? That seems like a stretch, but OTOH a GUC
doesn't feel great either: what is the poor user to do with a query
that does 2 sorts, one of which is faster with QS and the other of
which is faster with Timsort? Ugh.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
On Thu, Dec 29, 2011 at 9:03 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
The first (controversy A) is that I have added a new
piece of infrastructure, pg_always_inline, which, as the name
suggests, is a portable way of insisting that a function should be
invariably inlined.
I don't have a problem with the idea of a pg_always_inline, but I'm
wondering what sort of fallback mechanism you propose.
There is no compiler anywhere that implements "always inline", unless
you are talking about a macro. "inline" is a hint and nothing more,
and if you think you can force it you are mistaken. So this controversy
is easily resolved: we do not need any such construct.
The real question is whether we should accept a patch that is a
performance loss when the compiler fails to inline some reasonably
simple function. I think that would depend on the actual numbers
involved, so we'd need to see data before making a decision.
The second
possible point of contention (controversy B) is that I have jettisoned
various protections against bad qsort implementations that I believe
are a legacy of when we used the system qsort pre-2006, that can no
longer be justified.
No objection to that one, I think, as long as you've verified that our
implementation is in fact okay about these things.
regards, tom lane
On 5 January 2012 22:27, Tom Lane <tgl@sss.pgh.pa.us> wrote:
There is no compiler anywhere that implements "always inline", unless
you are talking about a macro. "inline" is a hint and nothing more,
and if you think you can force it you are mistaken. So this controversy
is easily resolved: we do not need any such construct.
I'm slightly puzzled by your remarks here. GCC documentation on this
is sparse (although, as I've demonstrated, I can get better numbers
using the always_inline attribute on GCC 4.3), but take a look at
this:
http://msdn.microsoft.com/en-us/library/z8y1yy88.aspx
While it is not strictly true to say that pg_always_inline could
inline every function under every set of circumstances, it's pretty
close to the truth. I do accept that this facility could quite easily
be abused if its use isn't carefully measured. I also accept that
C99/GNU C inline functions are generally just requests to the compiler
that may be ignored (and indeed the compiler may inline without being
asked to).
It's short sighted to see this as a case of inlining itself making a
big difference, so much as it making a big difference as an enabling
transformation.
The real question is whether we should accept a patch that is a
performance loss when the compiler fails to inline some reasonably
simple function. I think that would depend on the actual numbers
involved, so we'd need to see data before making a decision.
Who said anything about a performance loss? Since the raw improvement
to qsort_arg is so large, it's pretty difficult to imagine a
confluence of circumstances in which this results in a net loss. See
the figures at http://archives.postgresql.org/pgsql-hackers/2011-11/msg01316.php
, for example.
The pg_always_inline idea is relatively recent. It just serves to
provide additional encouragement to the compiler to inline, insofar as
that is possible on the platform in question. The compiler's
cost/benefit analysis cannot possibly appreciate how hot a code path
qsort_arg is, because it has a set of generic heuristics that are
quite fallible, and very probably are on quite conservative. We can
appreciate such things though.
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
On Thu, Jan 5, 2012 at 5:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
There is no compiler anywhere that implements "always inline", unless
you are talking about a macro. "inline" is a hint and nothing more,
and if you think you can force it you are mistaken. So this controversy
is easily resolved: we do not need any such construct.
That's basically in direct contradiction to the experimental evidence.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 5 January 2012 20:23, Robert Haas <robertmhaas@gmail.com> wrote:
I don't have a problem with the idea of a pg_always_inline, but I'm
wondering what sort of fallback mechanism you propose. It seems to me
that if the performance improvement is conditioned on inlining be done
and we're not confident that we can force the compiler to inline, we
ought to fall back to not making the specialization available, rather
than to making something available that may not work well. Of course
if it's only a question of a smaller performance gain, rather than an
actual loss, then that's not as much of an issue...
pg_always_inline is more a less a neat adjunct to what I've already
done. You can take it or leave it, but it would be irrational to
dismiss it out of hand, because it is demonstrably effective in this
case. Using non-portable things like function attributes is fairly
well precedented - we use __attribute__((format(*) )) frequently. We
don't need a fallback mechanism other than "#else #define
pg_always_inline inline #endif". I think these facilities are
available to well over 99% of our users, who use GCC, GCC compatible
compiler and MSVC builds.
I have basically no sympathy for anyone who uses a compiler that can't
inline functions. Do such people actually exist?
ISTM that protecting
against that outside the qsort implementation (but only for index
sorts) is wrong-headed.Assuming you mean that qsort_arg() will protect against this stuff so
that callers don't need to do it separately, +1.
That's exactly what I mean. Going quadratic in the face of lot of
duplicates is something that, remarkably, was a problem well into the
1990s, apparently because some guy noticed it one day, though I think
that lots of popular implementations happened to be unaffected anyway.
As you know, all queries tested have lots and lots of duplicates (a
~1.5GB table that contains the same number of distinct elements as a
~10MB table once did), and removing the "duplicate protection" for
btrees, on top of everything else, sees the qsort do an awful lot
better than HEAD does with the psuedo-protection. As I've said, we
already lack any such "protection" for heap tuples. We are unaffected
now anyway, in particular, because we do this, as apparently
recommended by both Sedgewick and Knuth:
while (pb <= pc && (r = cmp(pb, a)) <= 0)
{
if (r == 0)
{
swap(pa, pb);
pa += es;
}
pb += es;
}
Note that the partition swap occurs *because* the pivot and element
are equal. You might imagine that this is superfluous. It turns out
that it protects against the duplicates, resulting in sub-partitions
about the same size (though it occurs to me that it does so at the
expense of precluding parallelism). In short, Sedgewick would approve
of our qsort implementation.
As for the "compare a tuple to itself" thing, that's just silly, any
was probably only ever seen in some homespun implementation at some
point a relatively long time ago, but I've asserted against it anyway.
I can't find any description of how this actually works... obviously,
we'd like to find a close-to-median element, but how do we actually do
that cheaply?
Uh, it works whatever way you want it to work. The implementation has
to figure out how to get the median ahead of time.
I doubt that this would be worthwhile -- the pivot that we pick at the
toplevel doesn't really matter much; even if it's bad, we can recover
just fine if the pivot we pick one level down is good, or the level
after that. We only really have a problem if we keep systematically
picking bad pivots every time. And if we do have that problem,
improving the toplevel choice of pivot is only going to help modestly,
because we'll still be O(n^2) on each partition.Can I get a straw poll on how much of a problem worst-case performance
of qsort is believed to be?I'd be reluctant to remove any of the protections we currently have,
because I bet they were all put in to fix problems that people hit in
the field. But I don't know that we need more. Of course,
consolidating them into qsort() itself rather than its callers
probably makes sense, as noted above.
I was merely proposing something that would compliment the med3 method
and our existing protections.
In a perfect world, if it were somehow possible to know the perfect
pivots ahead of time from a histogram or something, we'd have a quick
sort variant with worst-case performance of O(n log(n)). That, or the
limited approximation that I've sketched would perhaps be worthwhile,
even if it was subject to a number of caveats. Wikipedia claims that
the worst case for quick sort is O(n log(n)) with the refinements
recommended by Sedgewick's 1978 paper, but this seems like nonsense to
me - the med3 technique is a good heuristic in practice, but it's
perfectly possible in theory for it's sampling to always get things
completely wrong (this is not an unfamiliar problem).I agree. After reading
http://nanxkurniawan.wordpress.com/2010/05/31/quicksort/ I am inclined
to think that things are still O(n lg n) even if we always partition
so that any fixed percentage of the data -- is on one side of the
partition element and the remainder is on the other side. So for
example if all of our partition elements are greater than 1% of the
elements in their partition and less than the other 99%, we'll still
be O(n lg n), though with a considerably higher constant factor, I
think. However, if our partition elements are always a fixed NUMBER
of elements from the end - whether it's 1 or 100 - then we'll be
O(n^2), though again the constant factor will depend on how many
elements you knock off each time. In practice I'm not sure whether an
algorithmic analysis matters much, because in real life the constant
factors are probably pretty important, hence the median-of-medians
approach with n > 40.
Yeah, it's true that you have to be exceptionally unlucky to actually
see worst-case performance, but I imagined that it might be enough of
a difference to be worth pursuing. Experimentally, I can certainly see
the difference (I simulated it), but it is underwhelming next to
everything else.
Might be worth simplifying the swap logic, given as we only ever have
to swap SortTuple structs in qsort_arg, and given that it would make
the meta qsort_arg that much less ugly.
While I'm thinking out loud, here's another idea: Have a GUC which,
when enabled (perhaps potentially at various different granularities),
makes Timsort the in-memory sorting algorithm, as it now is by default
for Python, Java SE7 (for arrays) and the Android platform; for
certain applications, this could be a real win, and I don't think that
it would have to be invasive: it could be dropped in.I gather from a quick read that this is supposed to be especially good
when the data is already somewhat sorted. Would there be any merit in
trying to guess when that's likely to be the case (e.g. based on the
correlation statistic)? That seems like a stretch, but OTOH a GUC
doesn't feel great either: what is the poor user to do with a query
that does 2 sorts, one of which is faster with QS and the other of
which is faster with Timsort? Ugh.
I'd imagined that it might work a bit like default_statistics_target.
Of course, I was just thinking out loud. Also, we do a very good job
on *perfectly* pre-sorted input because we check for pre-sorted input.
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
On Fri, Jan 6, 2012 at 12:10 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
As you know, all queries tested have lots and lots of duplicates (a
~1.5GB table that contains the same number of distinct elements as a
~10MB table once did), and removing the "duplicate protection" for
btrees, on top of everything else, sees the qsort do an awful lot
better than HEAD does with the psuedo-protection.
Does that also win vs. unpatched master? If so we may as well pull
that part out and commit it separately.
I can't find any description of how this actually works... obviously,
we'd like to find a close-to-median element, but how do we actually do
that cheaply?Uh, it works whatever way you want it to work. The implementation has
to figure out how to get the median ahead of time.
Oh. I'd be inclined to think that would be pretty inefficient, unless
we only did it for very large partitions or when we observed that some
less costly strategy was tanking.
I gather from a quick read that this is supposed to be especially good
when the data is already somewhat sorted. Would there be any merit in
trying to guess when that's likely to be the case (e.g. based on the
correlation statistic)? That seems like a stretch, but OTOH a GUC
doesn't feel great either: what is the poor user to do with a query
that does 2 sorts, one of which is faster with QS and the other of
which is faster with Timsort? Ugh.I'd imagined that it might work a bit like default_statistics_target.
Of course, I was just thinking out loud. Also, we do a very good job
on *perfectly* pre-sorted input because we check for pre-sorted input.
We occasionally have people who want to do SELECT * FROM foo WHERE a >
100 AND a < 200 ORDER BY a, b where foo has an index on (a) but not
(a, b). This tends to suck, because we fail to realize that we can
batch the sort. We either seqscan and filter the table then sort the
results, or else we scan the index on (a) and then ignore the fact
that we have data which is already partially sorted. It's
particularly annoying if a is "mostly unique" and needs b only
occasionally to break ties. It would be nice to improve this, but it
may be more of a planner/executor problem that something we can fix
directly inside the sort implementation.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 6 January 2012 17:29, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Jan 6, 2012 at 12:10 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
As you know, all queries tested have lots and lots of duplicates (a
~1.5GB table that contains the same number of distinct elements as a
~10MB table once did), and removing the "duplicate protection" for
btrees, on top of everything else, sees the qsort do an awful lot
better than HEAD does with the psuedo-protection.Does that also win vs. unpatched master? If so we may as well pull
that part out and commit it separately.
I didn't bother isolating that, because it doesn't really make sense
to (not having it is probably only of particular value when doing what
I'm doing anyway, but who knows). Go ahead and commit something to
remove that code (get it in both comparetup_index_btree and
comparetup_index_hash), as well as the tuple1 != tuple2 test now if
you like. It's patently clear that it is unnecessary, and so doesn't
have to be justified as a performance enhancement - it's a simple case
of refactoring for clarity. As I've said, we don't do this for heap
tuples and we've heard no complaints in all that time. It was added in
commit fbac1272b89b547dbaacd78bbe8da68e5493cbda, presumably when
problems with system qsorts came to light.
I'd imagined that it might work a bit like default_statistics_target.
Of course, I was just thinking out loud. Also, we do a very good job
on *perfectly* pre-sorted input because we check for pre-sorted input.We occasionally have people who want to do SELECT * FROM foo WHERE a >
100 AND a < 200 ORDER BY a, b where foo has an index on (a) but not
(a, b). This tends to suck, because we fail to realize that we can
batch the sort. We either seqscan and filter the table then sort the
results, or else we scan the index on (a) and then ignore the fact
that we have data which is already partially sorted. It's
particularly annoying if a is "mostly unique" and needs b only
occasionally to break ties. It would be nice to improve this, but it
may be more of a planner/executor problem that something we can fix
directly inside the sort implementation.
That sounds like an interesting problem. Something to look into for
9.3, perhaps.
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
Peter Geoghegan <peter@2ndquadrant.com> writes:
I didn't bother isolating that, because it doesn't really make sense
to (not having it is probably only of particular value when doing what
I'm doing anyway, but who knows). Go ahead and commit something to
remove that code (get it in both comparetup_index_btree and
comparetup_index_hash), as well as the tuple1 != tuple2 test now if
you like. It's patently clear that it is unnecessary, and so doesn't
have to be justified as a performance enhancement - it's a simple case
of refactoring for clarity. As I've said, we don't do this for heap
tuples and we've heard no complaints in all that time. It was added in
commit fbac1272b89b547dbaacd78bbe8da68e5493cbda, presumably when
problems with system qsorts came to light.
Actually, I'm going to object to reverting that commit, as I believe
that having equal-keyed index entries in physical table order may offer
some performance benefits at access time. If you don't like the
comment, we can change that.
regards, tom lane
On 6 January 2012 18:45, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Peter Geoghegan <peter@2ndquadrant.com> writes:
I didn't bother isolating that, because it doesn't really make sense
to (not having it is probably only of particular value when doing what
I'm doing anyway, but who knows). Go ahead and commit something to
remove that code (get it in both comparetup_index_btree and
comparetup_index_hash), as well as the tuple1 != tuple2 test now if
you like. It's patently clear that it is unnecessary, and so doesn't
have to be justified as a performance enhancement - it's a simple case
of refactoring for clarity. As I've said, we don't do this for heap
tuples and we've heard no complaints in all that time. It was added in
commit fbac1272b89b547dbaacd78bbe8da68e5493cbda, presumably when
problems with system qsorts came to light.Actually, I'm going to object to reverting that commit, as I believe
that having equal-keyed index entries in physical table order may offer
some performance benefits at access time. If you don't like the
comment, we can change that.
Please explain your position. When is this supposed to be useful? Even
if you can present an argument for keeping it, it has to weigh the
fact that people don't tend to have much use for indexes with lots of
duplicates anyway. Prior to 2004, we did not do this - it was
justified as insurance against bad qsort implementations only.
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
Peter Geoghegan <peter@2ndquadrant.com> writes:
On 6 January 2012 18:45, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Actually, I'm going to object to reverting that commit, as I believe
that having equal-keyed index entries in physical table order may offer
some performance benefits at access time. If you don't like the
comment, we can change that.
Please explain your position. When is this supposed to be useful?
When there are lots of duplicates of a particular indexed value, the
existing code will cause an indexscan to search them in physical order,
whereas if we remove the existing logic it'll be random --- in
particular, accesses to the same heap page can no longer be expected to
be clustered.
Admittedly, I don't have any numbers quantifying just how useful that
might be, but on the other hand you've not presented any evidence
justifying removing the behavior either. If we believe your position
that indexes don't generally have lots of duplicates, then the code in
question will seldom be reached and therefore there would be no
performance benefit in removing it.
regards, tom lane
On Fri, Jan 6, 2012 at 4:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Admittedly, I don't have any numbers quantifying just how useful that
might be, but on the other hand you've not presented any evidence
justifying removing the behavior either. If we believe your position
that indexes don't generally have lots of duplicates, then the code in
question will seldom be reached and therefore there would be no
performance benefit in removing it.
Obviously, many indexes are unique and thus won't have duplicates at
all. But if someone creates an index and doesn't make it unique, odds
are very high that it has some duplicates. Not sure how many we
typically expect to see, but more than zero...
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 6 January 2012 21:14, Tom Lane <tgl@sss.pgh.pa.us> wrote:
When there are lots of duplicates of a particular indexed value, the
existing code will cause an indexscan to search them in physical order,
whereas if we remove the existing logic it'll be random --- in
particular, accesses to the same heap page can no longer be expected to
be clustered.
Isn't it possible to get them in physical order anyway, by reading
them into memory in that order? Efficient quick sort implementations
are not stable, and ours is no exception, but we could perhaps come up
with a cheaper tie-breaker value at that stage, if you're determined
to maintain this behaviour. We have sufficient incentive to, as I
describe below.
Admittedly, I don't have any numbers quantifying just how useful that
might be, but on the other hand you've not presented any evidence
justifying removing the behavior either. If we believe your position
that indexes don't generally have lots of duplicates, then the code in
question will seldom be reached and therefore there would be no
performance benefit in removing it.
I ran the same btree benchmark on master, but without the "cheap
insurance". The results were interesting, to say the least.
The columns indexed were the same columns and data that I've been
using all along. Initially this made sense, as the performance of
multi sort key sorts often largely hinged on being able to get away
with doing one comparison per pair of tuples - with many duplicates, I
could avoid cheating and show something closer to worst case for the
patch.
I didn't think it mattered that indexing the same columns would
produce what happened to be a not so useful index in the real world,
due to having so many duplicates - better to have figures that were
somewhat comparable for btree tuple sorting and heap tuple sorting.
When I ran the same benchmark on a server that differed from master
only in that their was no insurance, it momentarily appeared that
*all* of the gains for btree index creation came from being able to
elide the "cheap insurance", but only where it would have to be paid
for a high number of times.
I soon realised that I'd made a blunder: the code (that is, the patch
that I posed most recently) wasn't even using my specialisation for
qsorting, because the SortSupport pointer was null! I did not have
tuplesort_begin_index_btree initialise the SortSupport struct as
tuplesort_begin_heap did, so my earlier benchmark was effectively
meaningless, except that it indicated the benefits of eliding the
cheap insurance alone, if only for that not so compelling case. You
should note that the benefits of not paying for the insurance can be
very significant indeed.
Attached are figures for an identical run of the same btree python
script, but with a version of the patch that actually uses my
specialisations. Granted, these numbers are still partially predicated
on the index in question having a large number of duplicates, but it's
worth noting:
1. The gain from specialisation isn't bad; not as good as the
improvements we saw for heap tuples, but not so bad either, especially
considering that binary bloat should be much less controversial for
btree tuples.
2. The index that results from the tests is still useful; the planner
is perfectly willing to use it rather than than performing an
in-memory sort. It will also use it to satisfy a query like "select *
from orderlines where prod_id = 5", albeit via a bitmap index scan. I
took the precaution of increasing default_statistics_target to its
maximum value, as well an performing an analyze on orderlines in
advance of checking this.
Revision to this patch that fixes the bug to follow - I produced these
new numbers from a rough cut of that.
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
Attachments:
Obviously, many indexes are unique and thus won't have duplicates at
all. But if someone creates an index and doesn't make it unique, odds
are very high that it has some duplicates. Not sure how many we
typically expect to see, but more than zero...
Peter may not, but I personally admin lots of databases which have
indexes on values like "category" or "city" which have 100's or 1000's
of duplicates per value. I don't think this is uncommon at all.
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
On 9 January 2012 19:45, Josh Berkus <josh@agliodbs.com> wrote:
Obviously, many indexes are unique and thus won't have duplicates at
all. But if someone creates an index and doesn't make it unique, odds
are very high that it has some duplicates. Not sure how many we
typically expect to see, but more than zero...Peter may not, but I personally admin lots of databases which have
indexes on values like "category" or "city" which have 100's or 1000's
of duplicates per value. I don't think this is uncommon at all.
Uh, then all the more reason to do what I recommend, I imagine. There
is most definitely a large overhead to creating such indexes, at least
for scalar types. As far as I can tell, Tom's complaint is quite
speculative.
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
On 6 January 2012 21:47, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Jan 6, 2012 at 4:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Admittedly, I don't have any numbers quantifying just how useful that
might be, but on the other hand you've not presented any evidence
justifying removing the behavior either. If we believe your position
that indexes don't generally have lots of duplicates, then the code in
question will seldom be reached and therefore there would be no
performance benefit in removing it.
I have decided on a tactical retreat in relation to this patch. This
has been dragging on since September. I cannot risk not having the
patch accepted for 9.2, due to trying to handle both heap and btree
tuple sorting at once - the additional relatively modest improvements
for common cases that it will bring to btree index creation time do
not warrant digging my heels in to cover that case in one larger
commit. For that reason, I attach for your consideration a revision of
the patch without any support for btree index tuples whatsoever
(though I have adjusted btree tuplesort comments, and one tiny piece
of code, in a non-controversial way). I'll revisit btree sorting
towards the end of the commitfest, circumstances permitting.
As to the question of binary bloat, I have devised a pgbench-tools
workload that I think will go some way towards reassuring those who
are concerned about its distributed costs. The attached spreadsheet
has all the relevant details.
A custom scripts has been specified (details of which are evident from
benchmark results themselves). If we experienced some kind of
noticeable marginalisation of usefully cached instructions, that's
obviously where it'll show up.
Note that I have taken the preparatory step of updating some tables
with random data, rather than the sequential data that they have by
default, which, unusually for such large tables, can allow us to avail
of our pre-sorted input check optimisation, spoiling things. I did
vacuumdb immediately afterwards, immediately before the pgbench run in
each case.
I've kept the test at 8 clients/8 threads, on a machine with 4
physical cores + hyperthreading for 8 logical ("Intel Core(TM) i7 CPU
870 @ 2.93GHz", with 4 x 32 KB instruction caches, among several
other CPU caches) on a newly-initialised, throw-away database, single
run at scale 50 for 15 minutes in all cases. This is the same server
that I have used throughout.
My analysis is that although there may be a very slight regression in
non-affected queries (it's a tough call to make - how queries happen
to coincide undoubtedly muddies the waters, as does the fact that we
cut lots of time from periods in which backends hold a lot of locally
allocated sort memory - it might actually be a win for those other
queries sometimes but not others, and it appears that the margin
either way is very small), that is more than compensated for by the
benefit of the specialisations. The CPU cache is doing its fjob as
expected, and we clearly see a net benefit from its preference for
cacheing more instructions that are specialised - those sort
operations are way more expensive (if they weren't, it wouldn't cache
them so heavily). I also include results for running the same query
again and again with a single client, to put that effect in context
(though note that previous benchmarks avoiding paying a high client
overhead by using explain analyze, and indeed originally compared
pre-SortSupport Postgres, so these numbers aren't as good either).
It's unusual for a database workload to be so heavily CPU bound, so
I'd suggest that the latter benchmark is more representative than the
former. Either way, we win by some margin. If the queries didn't have
such a high client overhead, as for example with a sort node that
feeds a merge join, we'd do better still.
If a sorting specialisation is never used anyway, the overhead, for
practical purposes, is zero. I believe that sorting specialisations
are just too useful to not be a net win in almost all reasonable
cases. Why haven't I used all specialisations at once, rather than
only two (one for single int4, the other multiple)? Well, I might have
used all of them, but there was no floats available in the pgbench
tables, and besides, the chances of all of them being simultaneously
in play during any sufficiently short period for marginalisation of
CPU cache contents to be of particular concern is, in general, not all
that great.
A major systems programming language, C++, produces multiple
specialisations as its standard library's sort function is used, for
example, each of which will have separate entries in the procedure
linkage table (though various implementation-defined techniques are
frequently used to reduce the number of copies across translation
units at link time). If the programmer specifies either a different
datatype, or a different comparator (through the use of an alternative
to the default std::less_than<T> functor/predicate), a whole new
specialisation is generated by the compiler. This does raise concerns
in relation to binary bloat, but they're reasonably well understood,
and this is the default way of doing things, for general purpose
application development. Now, I know that Postgres isn't written in
C++, and you might not find "C++ does it" to be a particularly
compelling argument, but it does go to show that these ideas are not
by any stretch of the imagination radical.
I remind everyone that the improvement seen in raw qsort_arg runtime
is fairly large, as it would have to be, in order to bring such large
though proportionally smaller improvements to each given affected
query as a whole - there are of course many other sources of overhead
involved in parsing, planning and executing affected queries.
Here is a complete detailing of the specialisations that this latest
revision produces:
int4, single key (supports date too)
int8, multiple keys (supports timestamps too, with and without TZ,
where we HAVE_INT64_TIMESTAMP)
float4, single key
float8, multiple keys
Type-generic single key specialisation. Expected to deliver some of
the same additional benefits for types that do not merit there own
specialisations but would still benefit, like name, int2, etc, but is
used all the time where applicable, even for types like text for which
it is expected that there will be no appreciable benefit.
Thoughts?
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
I decided to take advantage of my ongoing access to a benchmarking
server to see how I could get on with a query with an especially large
sort. Now, that box has 16GB of ram, and some people have that much
memory in their laptops these days, so I was somewhat limited in how
far I could push things.
I eventually decided upon much the same benchmark as I'd made in my
previous e-mail (the query "SELECT * FROM pgbench_accounts ORDER BY
bid;", but primed with random data, while being sure to subsequently
vacuum).
I stress that I have not made any attempt to artificially remove
client overhead. I have, however, used a faster disk (caches were not
warmed in a seperate run of pgbench or anything like that for either
of my two most recent benchmarks, so there may have been and indeed
may still be some small bias there), in addition to connecting pgbench
with the -h option. Apparently a known problem with Linux kernels
using the Completely Fair Scheduler introduced in 2.6.23 is that it
does not schedule the pgbench program very well when it's connecting
to the database usingUnix-domain sockets, as I have been until now.
I'm not sure that this had all that much potential to spoil my
results, but didn't want to take the chance .
Anyway, the results of running this latest benchmark, with 20
successive runs of each large query, were:
Patch:
pghost: localhost pgport: nclients: 1 nxacts: 20 dbName: pgbench
transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 1
number of threads: 1
number of transactions per client: 20
number of transactions actually processed: 20/20
tps = 0.030924 (including connections establishing)
tps = 0.030924 (excluding connections establishing)
statement latencies in milliseconds:
31163.957400 SELECT * FROM pgbench_accounts ORDER BY bid;
Master:
pghost: localhost pgport: nclients: 1 nxacts: 20 dbName: pgbench
pghost: localhost pgport: nclients: 1 nxacts: 20 dbName: pgbench
transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 1
number of threads: 1
number of transactions per client: 20
number of transactions actually processed: 20/20
tps = 0.026769 (including connections establishing)
tps = 0.026769 (excluding connections establishing)
statement latencies in milliseconds:
36179.508750 SELECT * FROM pgbench_accounts ORDER BY bid;
That's a larger proportional improvement than reported on Thursday,
and at significantly greater scale.
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services