Eliminating CREATE INDEX comparator TID tie-breaker overhead
I have one more idea for accelerating sorting-related operations that
is both fairly effective and relatively easy to implement. This just
about clears my backlog of those, though.
There is an open item on the TODO list entitled "Consider whether
duplicate keys should be sorted by block/offset" [1]https://wiki.postgresql.org/wiki/Todo#Sorting.
Currently, comparetup_index_btree() and comparetup_index_hash() do a
tie-breaker on ItemPointer (heap TID). This started as insurance
against bad qsort() implementations that do badly with many equal
keys, but it was also thought that there is some value in having equal
keys be in physical (heap) order. Clearly the first concern has been
obsolete since 2006, when a high quality qsort() was added to
Postgres, but the second concern is probably still valid.
Overhead
-------------
Tom said in 2008 that the CREATE INDEX overhead of doing this may be
about 7% [2]/messages/by-id/23321.1205726381@sss.pgh.pa.us -- Peter Geoghegan. It seems even worse now, presumably due to SortSupport
reducing costs elsewhere. Several years ago, I suggested ripping out
the tie-breaker too, but didn't get very far with that idea due to the
potential downsides for index scans. I'm not about to revisit the
discussion from 2011 about whether or not it should be torn out. I'd
rather just (mostly) fix the real problem without changing the
behavior of comparetup_index_btree() (or comparetup_index_hash()).
The real problem is that we do pointer chasing to get to the
IndexTuple ("tuple proper"), where we might get away with only
examining the SortTuple, as would happen in comparetup_heap() in the
event of an equivalent heap tuplesort, for example. The added memory
latency hurts lower cardinality attributes (when only one attribute
appears in the Index definition and the IndexTuple isn't cache
resident), and wastes memory bandwidth on the system, which is
something that there is virtually never enough of.
Patch
=====
Attached patch adds a new tie-breaker which is used only when the
tuples being sorted still fit in memory. Currently, the field
SortTuple.tupindex has a different purpose depending on whether we're
building initial runs or merging runs. However, SortTuple.tupindex
currently has no purpose when a sort can complete entirely in memory,
so that case is available to support a third meaning:
SortTuple.tupindex can be used to hold a simple ordinal number. When
our status is TSS_INITIAL (when we qsort()), this is used as a proxy
for what the TID tie-breaker would ordinarily return.
This reliably preserves the existing behavior because index tuplesort
clients invariably scan heap tuples in physical order, and so nothing
is lost.
Performance
-----------------
The patch can make CREATE INDEX with an unlogged B-Tree index with a
single int4 attribute as much as 15% faster (although I think under
10% is much more likely). That seems like an improvement that makes
the patch worthwhile.
Design considerations and consequences
--------------------------------------------------------
I don't think that there is much point in getting rid of
SortTuple.tupindex for in-memory sorts, which this patch closes off
the possibility of. I tested the performance of a SortTuple struct
with only a datum1 and "tuple proper" for in-memory sorting at one
time, and it was disappointing, and in any case unworkably invasive.
I also don't think it's worth worrying about the fact that tupindex is
assigned an ordinal number even in cases where it won't be used inside
the comparator. The overhead has to be indistinguishable from zero
anyway, and the right place to put that assignment happens to be where
there is generic TSS_INITIAL copying within puttuple_common().
I'm not concerned about synchronized scans breaking my assumption of a
physical ordering of heaptuples being fed to tuplesort.c. I think that
it is unlikely to ever be worth seriously considering this case.
I have a hard time imagining anything (beyond synchronous scans)
breaking my assumption that index tuplesorts receive tuples in heap
physical order. If anything was to break that in the future (e.g.
parallelizing the heap scan for index builds), then IMV the onus
should be on that new case to take appropriate precautions against
breaking my assumption.
[1]: https://wiki.postgresql.org/wiki/Todo#Sorting
[2]: /messages/by-id/23321.1205726381@sss.pgh.pa.us -- Peter Geoghegan
--
Peter Geoghegan
Attachments:
0001-Cache-aware-index-sort-comparator-tie-breakers.patchtext/x-patch; charset=US-ASCII; name=0001-Cache-aware-index-sort-comparator-tie-breakers.patchDownload
From 22e3aa14436d3c4ca8ccde2cc55777c6207fd9f0 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghegan86@gmail.com>
Date: Thu, 16 Jul 2015 22:26:01 -0700
Subject: [PATCH] Cache-aware index sort comparator tie-breakers
For the B-Tree and hash index build sort cases only, tuplesort
previously performed a tie-breaker on ItemPointer (heap physical order).
This may later provide some performance benefits for index scans.
However, it implied significant overhead during index builds, because
additional pointer chasing (to the IndexTuple "tuple proper") was
required in the comparators, which adds significant memory latency where
the "tuple proper" does not otherwise need to be accessed (e.g. single
pass-by-value attribute B-Tree sorts).
To accomplish the same outcome inexpensively, ensure stable sort output
in the traditional way for quicksort: tie-break by comparing a simple
ordinal number (reuse the field SortTuple.tupindex for this, since when
comparing it's already bound to be in L1 cache). A "tuple proper"
tie-breaker is therefore not required. However, since
SortTuple.tupindex is already used by external sorts, it cannot be
repurposed there, so external sorts still tie-break the expensive way.
This formalizes a requirement for index build tuplesort clients that
they've always met: that tuples are supplied to tuplesort in the
original physical/heap order. It seems acceptable that this is somewhat
undermined by synchronized heap scans. Someday, we might be able to
remove all tie-breakers for external sorts, if tape sort can be made to
be a stable sort (as many merge sort implementations are).
In passing, move a related assertion to a slightly different, more
appropriate place.
---
src/backend/utils/sort/tuplesort.c | 73 +++++++++++++++++++++++++++-----------
1 file changed, 53 insertions(+), 20 deletions(-)
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 435041a..ab69320 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -160,11 +160,15 @@ bool optimize_bounded_sort = true;
* described above. Accordingly, "tuple" is always used in preference to
* datum1 as the authoritative value for pass-by-reference cases.
*
- * While building initial runs, tupindex holds the tuple's run number. During
- * merge passes, we re-use it to hold the input tape number that each tuple in
- * the heap was read from, or to hold the index of the next tuple pre-read
- * from the same tape in the case of pre-read entries. tupindex goes unused
- * if the sort occurs entirely in memory.
+ * tupindex is initially used as an ordinal identifier; in other words, it
+ * tracks the original physical position of the tuple. This is useful for
+ * index sorts that are ultimate performed entirely in memory, since equal
+ * IndexTuples can remain in physical order by just comparing tupindex. If
+ * the sort does not fit in memory, then, while building initial runs,
+ * tupindex holds the tuple's run number. During merge passes, we re-use it
+ * to hold the input tape number that each tuple in the heap was read from, or
+ * to hold the index of the next tuple pre-read from the same tape in the case
+ * of pre-read entries.
*/
typedef struct
{
@@ -1237,6 +1241,14 @@ tuplesort_putheaptuple(Tuplesortstate *state, HeapTuple tup)
/*
* Collect one index tuple while collecting input data for sort, building
* it from caller-supplied values.
+ *
+ * Note that there is an assumption that callers provide values from heap
+ * tuples in physical order. This makes maintaining that order among equal
+ * tuples inexpensive. Synchronized heap scans performed by callers might
+ * undermine the assumption. This isn't worth guarding against, since the
+ * assumption exists only for performance reasons, and the resulting index
+ * will have almost physical ordering among equal tuples, which is almost
+ * as good.
*/
void
tuplesort_putindextuplevalues(Tuplesortstate *state, Relation rel,
@@ -1405,7 +1417,9 @@ puttuple_common(Tuplesortstate *state, SortTuple *tuple)
(void) grow_memtuples(state);
Assert(state->memtupcount < state->memtupsize);
}
- state->memtuples[state->memtupcount++] = *tuple;
+ /* When we still fit in memory, tupindex is ordinal tuple # */
+ tuple->tupindex = state->memtupcount++;
+ state->memtuples[tuple->tupindex] = *tuple;
/*
* Check if it's time to switch over to a bounded heapsort. We do
@@ -3553,6 +3567,13 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
}
/*
+ * Some rather brain-dead implementations of qsort (such as the one in QNX
+ * 4) will sometimes call the comparison routine to compare a value to
+ * itself, but we always use our own implementation, which does not.
+ */
+ Assert(tuple1 != tuple2);
+
+ /*
* If btree has asked us to enforce uniqueness, complain if two equal
* tuples are detected (unless there was at least one NULL field).
*
@@ -3567,14 +3588,6 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
bool isnull[INDEX_MAX_KEYS];
char *key_desc;
- /*
- * Some rather brain-dead implementations of qsort (such as the one in
- * QNX 4) will sometimes call the comparison routine to compare a
- * value to itself, but we always use our own implementation, which
- * does not.
- */
- Assert(tuple1 != tuple2);
-
index_deform_tuple(tuple1, tupDes, values, isnull);
key_desc = BuildIndexValueDescription(state->indexRel, values, isnull);
@@ -3590,10 +3603,20 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
}
/*
- * If key values are equal, we sort on ItemPointer. This does not affect
- * validity of the finished index, but it may be useful to have index
- * scans in physical order.
+ * If key values are equal and we are doing an internal sort, we sort on
+ * tupindex, which is a simple ordinal tuple number (when status is
+ * TSS_INITIAL). This does not affect validity of the finished index, but
+ * it may be useful to have index scans in physical order.
+ *
+ * For external sorts, we sort on ItemPointer. This is necessary because
+ * even tape sort is not a stable sort, and external sorts use tupindex
+ * for another purpose. It's worth avoiding pointer chasing where
+ * possible, since sorting is usually bottlenecked on memory latency, and
+ * tuple1 and tuple2 may not be cache resident.
*/
+ if (state->status == TSS_INITIAL)
+ return (a->tupindex < b->tupindex) ? -1 : 1;
+
{
BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
BlockNumber blk2 = ItemPointerGetBlockNumber(&tuple2->t_tid);
@@ -3636,10 +3659,20 @@ comparetup_index_hash(const SortTuple *a, const SortTuple *b,
return -1;
/*
- * If hash values are equal, we sort on ItemPointer. This does not affect
- * validity of the finished index, but it may be useful to have index
- * scans in physical order.
+ * If hash values are equal and we are doing an internal sort, we sort on
+ * tupindex, which is a simple ordinal tuple number (when status is
+ * TSS_INITIAL). This does not affect validity of the finished index, but
+ * it may be useful to have index scans in physical order.
+ *
+ * For external sorts, we sort on ItemPointer. This is necessary because
+ * even tape sort is not a stable sort, and external sorts use tupindex
+ * for another purpose. It's worth avoiding pointer chasing where
+ * possible, since sorting is usually bottlenecked on memory latency, and
+ * tuple1 and tuple2 are unlikely to be cache resident.
*/
+ if (state->status == TSS_INITIAL)
+ return (a->tupindex < b->tupindex) ? -1 : 1;
+
tuple1 = (IndexTuple) a->tuple;
tuple2 = (IndexTuple) b->tuple;
--
1.9.1
On Tue, Jul 21, 2015 at 4:06 PM, Peter Geoghegan <pg@heroku.com> wrote:
Design considerations and consequences
--------------------------------------------------------
Good write-up.
I'm not concerned about synchronized scans breaking my assumption of a
physical ordering of heaptuples being fed to tuplesort.c. I think that
it is unlikely to ever be worth seriously considering this case.
Why not?
I have a hard time imagining anything (beyond synchronous scans)
breaking my assumption that index tuplesorts receive tuples in heap
physical order. If anything was to break that in the future (e.g.
parallelizing the heap scan for index builds), then IMV the onus
should be on that new case to take appropriate precautions against
breaking my assumption.
I'm very dubious about that. There are lots of reasons why we might
want to read tuples out of order; for example, suppose we want a
parallel sequential scan to feed the sort.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jul 22, 2015 at 11:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:
I'm not concerned about synchronized scans breaking my assumption of a
physical ordering of heaptuples being fed to tuplesort.c. I think that
it is unlikely to ever be worth seriously considering this case.Why not?
The case for doing this tie-breaker is theoretical. The original
rationale for adding it is now obsolete. On the other hand, the cost
of doing it is most certainly not theoretical.
If it's absolutely necessary to preserve a guarantee that equal tuples
are in TID order (rather than at most two staggered sequential
groupings per set of equal tuples -- possible with synchronized
scans), then I suggest we detect synchronized scans and disable this
optimization. They're not especially common, so it would still be
worthwhile, in my estimation.
I have a hard time imagining anything (beyond synchronous scans)
breaking my assumption that index tuplesorts receive tuples in heap
physical order. If anything was to break that in the future (e.g.
parallelizing the heap scan for index builds), then IMV the onus
should be on that new case to take appropriate precautions against
breaking my assumption.I'm very dubious about that. There are lots of reasons why we might
want to read tuples out of order; for example, suppose we want a
parallel sequential scan to feed the sort.
I agree that there are many reasons to want to do that. If we ever get
parallel sorts, then having a bunch of memtuple arrays, each fed by a
worker participating in a parallel scan makes sense. Those runs could
still be sorted in physical order. Only the merge phase would have to
do pointer chasing to tie-break on the real TID, and maybe not even
then (because run number can also be a proxy for physical order when
merging, assuming some new parallel internal sort algorithm).
Actually, for tape sort/replacement selection sort, the initial heap
build (where run number 0 is assigned currently) could probably reuse
this trick. I just haven't done that because it would be significantly
more invasive and less helpful.
If it's just a matter of wanting to parallelize the heap scan for its
own sake, then I don't think that's likely to be a terribly effective
optimization for B-Tree index builds; most of the cost is always going
to be in the sort proper, which I'm targeting here. In any case, I
think that the very common case where an int4 PK index is built using
quicksort should not have to suffer to avoid a minor inconveniencing
of (say) parallel sorts.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jul 22, 2015 at 3:17 PM, Peter Geoghegan <pg@heroku.com> wrote:
I have a hard time imagining anything (beyond synchronous scans)
breaking my assumption that index tuplesorts receive tuples in heap
physical order. If anything was to break that in the future (e.g.
parallelizing the heap scan for index builds), then IMV the onus
should be on that new case to take appropriate precautions against
breaking my assumption.I'm very dubious about that. There are lots of reasons why we might
want to read tuples out of order; for example, suppose we want a
parallel sequential scan to feed the sort.I agree that there are many reasons to want to do that. If we ever get
parallel sorts, then having a bunch of memtuple arrays, each fed by a
worker participating in a parallel scan makes sense. Those runs could
still be sorted in physical order. Only the merge phase would have to
do pointer chasing to tie-break on the real TID, and maybe not even
then (because run number can also be a proxy for physical order when
merging, assuming some new parallel internal sort algorithm).
Actually, for tape sort/replacement selection sort, the initial heap
build (where run number 0 is assigned currently) could probably reuse
this trick. I just haven't done that because it would be significantly
more invasive and less helpful.If it's just a matter of wanting to parallelize the heap scan for its
own sake, then I don't think that's likely to be a terribly effective
optimization for B-Tree index builds; most of the cost is always going
to be in the sort proper, which I'm targeting here. In any case, I
think that the very common case where an int4 PK index is built using
quicksort should not have to suffer to avoid a minor inconveniencing
of (say) parallel sorts.
My priorities are different from yours. Your conclusion is basically
that it's OK to burden everyone who comes along and does future
development that may use the sorting code differently from the way
it's used now with dealing with this issue somehow, or deciding not to
deal with it. I have a really tough time agreeing with that;
tuplesort.c is, and should be, an abstraction layer, and people using
it from the outside should not need to worry about what happens on the
inside.
Your original post lays out two rationales for the TID comparisons,
and says that one of them is obsolete, but the other is "probably"
still valid. I think what you should do is go find out whether the
second rationale is valid or not. If it's not, we can get rid of that
code. If it is valid, then we can't. I'm not going to endorse the
notion that tuplesort.c will only DTRT if it receives tuples in TID
order; it cannot be the responsibility of the caller of the sort code
to ensure that the tuples are sorted. Even if it shaves a few
percentage points off the runtime now, the complexity it imposes on
future patch authors is, IMO, not worth it.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jul 23, 2015 at 8:19 AM, Robert Haas <robertmhaas@gmail.com> wrote:
My priorities are different from yours. Your conclusion is basically
that it's OK to burden everyone who comes along and does future
development that may use the sorting code differently from the way
it's used now with dealing with this issue somehow, or deciding not to
deal with it. I have a really tough time agreeing with that;
tuplesort.c is, and should be, an abstraction layer, and people using
it from the outside should not need to worry about what happens on the
inside.
I don't know where this idea came from, because I'm already supporting
the requirement for the external sorting case, where doing this is too
much of a burden. That is entirely unchanged, and so such differences
are clearly already respected. Any new patch likely to care about this
(e.g. parallel internal sorts) will probably naturally not be affected
by it anyway, just because they'll add a new tuplesort state, or
because multiple "memtuples" arrays will be filled in parallel, but
still in sequential order, and so everything works out just the same.
Aside from that, the author might have to spend 5 minutes thinking
about it. I don't see the problem.
Adding this single new requirement for exactly 2 extant callers does
not make tuplesort any less well encapsulated. Please don't write
wildly inaccurate summaries of what I've said, like "Your conclusion
is basically that it's OK to burden everyone who comes along and does
future development that may use the sorting code differently from the
way it's used now". That is patently untrue.
Your original post lays out two rationales for the TID comparisons,
and says that one of them is obsolete, but the other is "probably"
still valid. I think what you should do is go find out whether the
second rationale is valid or not. If it's not, we can get rid of that
code. If it is valid, then we can't. I'm not going to endorse the
notion that tuplesort.c will only DTRT if it receives tuples in TID
order; it cannot be the responsibility of the caller of the sort code
to ensure that the tuples are sorted. Even if it shaves a few
percentage points off the runtime now, the complexity it imposes on
future patch authors is, IMO, not worth it.
More than a few - sometimes more than 10%.
The second rationale was, as far as I can tell, a theoretical one that
was never experimentally validated. I'm pretty sure you could come up
with a case where not having it hurt, if you were sufficiently
creative. I'm not sure that I have the stomach for another protracted
debate about these fuzzy costs, which this patch was suppose to avoid.
However, you don't like this patch for reasons that I cannot fathom. I
think that I will have to withdraw it, and forget about cutting this
unnecessary cost from B-Tree builds.
Our priorities are different, but mine are changing; I simply don't
want to spend a lot of time arguing with you about things like this.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jul 23, 2015 at 2:19 PM, Peter Geoghegan <pg@heroku.com> wrote:
Our priorities are different, but mine are changing; I simply don't
want to spend a lot of time arguing with you about things like this.
Good.
If other people feel strongly about this issue, then they can weigh in
and we'll see where we end up. If they don't, then there's no
consensus to proceed with this, and we shouldn't *have* to spend a lot
of time on it.
Also, I resent the implication that I wrote a deliberately inaccurate
summary of your position. If it was inaccurate, it wasn't deliberate.
More likely, we simply view the situation differently. Please assume
good faith.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jul 23, 2015 at 11:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:
If other people feel strongly about this issue, then they can weigh in
and we'll see where we end up. If they don't, then there's no
consensus to proceed with this, and we shouldn't *have* to spend a lot
of time on it.
If no one weighs in after a few days, I'll mark the patch "rejected"
in the CF app.
Also, I resent the implication that I wrote a deliberately inaccurate
summary of your position. If it was inaccurate, it wasn't deliberate.
More likely, we simply view the situation differently. Please assume
good faith.
You wrote "Your conclusion is basically that it's OK to burden
everyone who comes along and does future development that may use the
sorting code differently from the way it's used now". If you'd like me
to assume good faith in these situations, maybe you should be more
careful about your choice of words. You statement was extremely broad,
unlike the very narrow technical issue under discussion.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
2015-07-23 Robert Haas <robertmhaas@gmail.com>:
I think what you should do is go find out whether the second rationale
is valid or not.
Knowing how much impact on performance such “non TID ordered” entries
have, would of course be very useful for future patch authors to know.
Especially useful would be to know whether interleaving a small number
of TID ordered streams (as would probably be generated by parallel
scans/processing) would result in an ordering that performs
significantly worse or not. I assume (but cannot prove) that in this
case the OS will understand the read pattern as being multiple streams
and prefetching will work correctly.
I'm not going to endorse the notion that tuplesort.c will only DTRT if
it receives tuples in TID order; it cannot be the responsibility of
the caller of the sort code to ensure that the tuples are sorted.
Except that it will do the right thing (as in correctness), but maybe
result in not the best overall performance possible (for future
queries). I think that it is a typical property of “reasons for
performance to be good” that they rely on a lot of code that is
otherwise independent, to work together the right way.
Nicolas
--
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
2015-07-24 Nicolas Barbier <nicolas.barbier@gmail.com>:
Especially useful would be to know whether interleaving a small number
of TID ordered streams (as would probably be generated by parallel
scans/processing) would result in an ordering that performs
significantly worse or not. I assume (but cannot prove) that in this
case the OS will understand the read pattern as being multiple streams
and prefetching will work correctly.
OTOH, that is probably only true when there are a large number of
duplicate keys. Otherwise the order within each (small) group will
appear random, which may or may not result in a significant
performance drop. This probably also depends on whether fadvise (or
friends) are used.
Nicolas
--
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jul 23, 2015 at 11:44 AM, Peter Geoghegan <pg@heroku.com> wrote:
If no one weighs in after a few days, I'll mark the patch "rejected"
in the CF app.
The patch has been marked "rejected" in the CF app. I withdraw it.
Obviously I still think that the patch is worthwhile, but not if that
"while" is disproportionate to any benefit, which I suspect will be
the case if I proceed.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers