More work on SortSupport for text - strcoll() and strxfrm() caching

Started by Peter Geogheganalmost 11 years ago34 messageshackers

pg@bowt.ie

almost 11 years ago

Since apparently we're back to development work, I thought it was time
to share a patch implementing a few additional simple tricks to make
sorting text under a non-C locale even faster than in 9.5. These
techniques are mostly effective when values are physically clustered
together. This might be because there is a physical/logical
correlation, but cases involving any kind of clustering of values are
helped significantly.

Caching
======

The basic idea is that we cache strxfrm() blobs. Separately, we
exploit temporal locality and clustering of values by caching the
result of the most recent strcoll()-resolved comparison performed. The
strxfrm() technique helps a lot with low cardinality single attribute
sorts if we can avoid most strxfrm() work. On the other hand,
strcoll() comparison caching particularly helps with multi-attribute
sorts where there is a low to moderate cardinality secondary attribute
and low cardinality leading attribute. The master branch will still
opportunistically take the equality memcmp() fastpath plenty of times
for that second attribute, but there are no abbreviated keys to help
when that doesn't work out (because it isn't the leading attribute).

Regressions
==========

The patch still helps with strcoll() avoidance when the ordering of a
moderate cardinality attribute is totally random, but it helps much
less there. I have not seen a regression for any case. I'm expecting
someone to ask me to do something with the program I wrote last year,
to prove the opportunistic memcmp() equality fastpath for text is
"free" [1]/messages/by-id/5415A843.3070602@vmware.com. This patch has exactly the same tension as last year's
memcmp() equality one [2]Commit e246b3d6: I add something opportunistic, that in
general might consistently not work out at all in some cases, and on
the face of it implies extra costs for those cases -- costs which must
be paid every single time. So as with the opportunistic memcmp()
equality thing, the *actual* overhead for cases that do not benefit
must be virtually zero for the patch to be worthwhile. That is the
standard that I expect that this patch will be held to, too.

Benchmark
=========

The query that I've been trying this out with is a typical rollup
query, using my "cities" sample data [3]http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/data/cities.dump (this is somewhat, although
not perfectly correlated on (country, province) before sorting):

*** SNIP *****
...

With master, this takes about 525ms when it stabilizes after a few
runs on my laptop. With the patch, it takes about 405ms. That's almost
a 25% reduction in total run time. If I perform a more direct test of
sort performance against this data with minimal non-sorting overhead,
I see a reduction of as much as 30% in total query runtime (I chose
this rollup query because it is obviously representative of the real
world).

If this data is *perfectly* correlated (e.g. because someone ran
CLUSTER) and some sort can use the dubious "bubble sort best case"
path [4]Commit a3f0b3d6 -- Peter Geoghegan that we added to qsort back in 2006, the improvement still
hold up at ~20%, I've found.

Performance of the "C" locale
---------------------------------------

For this particular rollup query, my 25% improvement leaves the
collated text sort perhaps marginally faster than an equivalent query
that uses the "C" locale (with or without the patch applied). It's
hard to be sure that that effect is real -- many trials are needed --
but it's reasonable to speculate that it's possible to sometimes beat
the "C" locale because of factors like final abbreviated key
cardinality.

It's easy to *contrive* a case where the "C" locale is beaten even
with 9.5 -- just sort a bunch of strings (that are abbreviated), that
look something like this:

"``..,,``..AAA"
"``..,,``..CCC"
"``..,,``..ZZZ"
"``..,,``..BBB"

Anyway, this avoidance of strxfrm() work I've introduced makes it
possible that abbreviated keys could make a strxfrm() locale-based
sort beat the "C" locale fair-and-square with a realistic dataset and
specific realistic query. That would be pretty nice, because that
can't be too far from optimal, and these cases are not uncommon.

A further idea -- unsigned integer comparisons
===================================

I've also changed text abbreviated keys to compare as unsigned
integers. On my Thinkpad laptop (which, of course, has an Intel CPU),
this makes a noticeable difference. memcmp() may be fast, but an
unsigned integer comparison is even faster (not sure if a big-endian
machine can have the existing memcmp() call optimized away, so that
effectively the same thing happens automatically).

Maybe other platforms benefit less, but it's very hard to imagine it
ever costing us anything.

[1]: /messages/by-id/5415A843.3070602@vmware.com
[2]: Commit e246b3d6
[3]: http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/data/cities.dump
[4]: Commit a3f0b3d6 -- Peter Geoghegan
--
Peter Geoghegan

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Peter Geoghegan (#1)

Re: More work on SortSupport for text - strcoll() and strxfrm() caching

On Fri, Jul 3, 2015 at 8:33 PM, Peter Geoghegan <pg@heroku.com> wrote:

Since apparently we're back to development work, I thought it was time
to share a patch implementing a few additional simple tricks to make
sorting text under a non-C locale even faster than in 9.5. These
techniques are mostly effective when values are physically clustered
together. This might be because there is a physical/logical
correlation, but cases involving any kind of clustering of values are
helped significantly.

Interesting work.

Some comments:

1. My biggest gripe with this patch is that the comments are not easy
to understand. For example:

+     * Attempt to re-use buffers across calls.  Also, avoid work in the event
+     * of clustered together identical items by exploiting temporal locality.
+     * This works well with divide-and-conquer, comparison-based sorts like
+     * quicksort and mergesort.
+     *
+     * With quicksort, there is, in general, a pretty strong chance that the
+     * same buffer contents can be used repeatedly for pivot items -- early
+     * pivot items will account for large number of total comparisons, since
+     * they must be compared against many (possibly all other) items.

Well, what I would have written is something like: "We're likely to be
asked to compare the same strings repeatedly, and memcmp() is so much
cheaper than memcpy() that it pays to attempt a memcmp() in the hopes
of avoiding a memcpy(). This doesn't seem to slow things down
measurably even if it doesn't work out very often."

+     * While it is worth going to the trouble of trying to re-use buffer
+     * contents across calls, ideally that will lead to entirely avoiding a
+     * strcoll() call by using a cached return value.
+     *
+     * This optimization can work well again and again for the same set of
+     * clustered together identical attributes;  as they're relocated to new
+     * subpartitions, only one strcoll() is required for each pivot (in respect
+     * of that clump of identical values, at least).  Similarly, the final
+     * N-way merge of a mergesort can be effectively accelerated if each run
+     * has its own locally clustered values.

And here I would have written something like: "If we're comparing the
same two strings that we compared last time, we can return the same
answer without calling strcoll() again. This is more likely than it
seems, because quicksort compares the same pivot against many values,
and some of those values might be duplicates."

Of course everybody may prefer something different here; I'm just
telling you what I think.

2. I believe the change to bttextcmp_abbrev() should be pulled out
into a separate patch and committed separately. That part seems like
a slam dunk.

3. What is the worst case for the strxfrm()-reuse stuff? I suppose
it's the case where we have many strings that are long, all
equal-length, and all different, but only in the last few characters.
Then the memcmp() is as expensive as possible but never works out.
How does the patch hold up in that case?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

More work on SortSupport for text - strcoll() and strxfrm() caching

Attachments:

Attachments:

Attachments:

Attachments: