sortsupport for text

Started by Robert Haasabout 14 years ago75 messageshackers
Jump to latest
#1Robert Haas
robertmhaas@gmail.com

I decided to investigate the possible virtues of allowing "text" to
use the sortsupport infrastructure, since strings are something people
often want to sort. I generated 100,000 random alphanumeric strings,
each 30 characters in length, and loaded them into a single-column
table, froze it, ran SELECT SUM(1) FROM tablename on it until it was
fully cached, and then repeatedly quicksorted the table contents using
my default locale (en_US.UTF8). I repeated this test a number of
times, removing and recreating the data directory via initdb each
time. The test was performed on my home desktop, which is running
Fedora 14 (yeah, I know I should reinstall) and equipped with an AMD
Athlon 5000 Dual-Core Processor. Here's the exact test query:

SELECT SUM(1) FROM (SELECT * FROM randomtext ORDER BY t) x;

On unpatched master, this takes about 416 ms (plus or minus a few).
With the attached patch, it takes about 389 ms (plus or minus a very
few), a speedup of about 7%.

I repeated the experiment using the C locale, like this:

SELECT SUM(1) FROM (SELECT * FROM randomtext ORDER BY t COLLATE "C") x;

Here, it takes about 202 ms with the patch, and about 231 ms on
unpatched master, a savings of about 13%.

I also tried on a larger data set of 5 million strings, with a heap
sort using work_mem=256MB. Unfortunately, there was so much noise
there that it was hard to get any useful measurements: the exact same
code base, using the exact same test script (that started with an
initdb) could perform quite differently on consecutive runs, perhaps
because the choice of blocks chosen to contain the database itself
affected the efficiency of reading and writing temporary files. I
think it may be faster, and the results on the smaller data set argue
that it should be faster, but I was unable to gather reproducible
numbers. I did observe the following oprofile results on a run on
this larger data set, on master:

12789 28.2686 libc-2.13.so strcoll_l
6802 15.0350 postgres text_cmp
5081 11.2310 postgres comparetup_heap
3510 7.7584 postgres comparison_shim
2892 6.3924 postgres lc_collate_is_c
2722 6.0167 no-vmlinux /no-vmlinux
2596 5.7382 postgres varstr_cmp
2517 5.5635 libc-2.13.so __strlen_sse2
2515 5.5591 libc-2.13.so __memcpy_sse2
968 2.1397 postgres tuplesort_heap_siftup
710 1.5694 postgres bttextcmp
664 1.4677 postgres pg_detoast_datum_packed

Clearly, a lot of that is unnecessary. Doing lc_collate_is_c for
every tuple is a complete waste; as is translating the collation OID
to collate_t; this patch arranges to do those things just once per
sort. The comparison_shim is also a waste. Considering all that, I
had hoped for more like a 15-20% gain from this approach, but it
didn't happen, I suppose because some of the instructions saved just
resulted in more processor stalls. All the same, I'm inclined to
think it's still worth doing.

I didn't attempt to handle the weirdness that is UTF-8 on Windows,
since I don't develop on Windows. I thought when I wrote this code
that I could just leave the comparator uninitialized and let the
caller default to the shim implementation if the sort-support function
didn't do anything. But I see now that
PrepareSortSupportFromOrderingOp() feels that it's entitled to assume
that the sort-support function will always fill in a comparator.
Either that assumption needs to be changed, or the corresponding
Windows code needs to be written, or the sort support function needs
to call PrepareSortSupportComparisonShim() in this case.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

sortsupport-text-v1.patchapplication/octet-stream; name=sortsupport-text-v1.patchDownload+203-10
#2Noah Misch
noah@leadboat.com
In reply to: Robert Haas (#1)
Re: sortsupport for text

On Fri, Mar 02, 2012 at 03:45:38PM -0500, Robert Haas wrote:

SELECT SUM(1) FROM (SELECT * FROM randomtext ORDER BY t) x;

On unpatched master, this takes about 416 ms (plus or minus a few).
With the attached patch, it takes about 389 ms (plus or minus a very
few), a speedup of about 7%.

I repeated the experiment using the C locale, like this:

SELECT SUM(1) FROM (SELECT * FROM randomtext ORDER BY t COLLATE "C") x;

Here, it takes about 202 ms with the patch, and about 231 ms on
unpatched master, a savings of about 13%.

[oprofile report, further discussion]

Thanks for looking into this. Your patch is also a nice demonstration of
sortsupport's ability to help with more than just fmgr overhead.

Considering all that, I
had hoped for more like a 15-20% gain from this approach, but it
didn't happen, I suppose because some of the instructions saved just
resulted in more processor stalls. All the same, I'm inclined to
think it's still worth doing.

This is a border case, but I suggest that a 13% speedup on a narrowly-tailored
benchmark, degrading to 7% in common configurations, is too meager to justify
adopting this patch.

nm

#3Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Noah Misch (#2)
Re: sortsupport for text

Noah Misch <noah@leadboat.com> wrote:

On Fri, Mar 02, 2012 at 03:45:38PM -0500, Robert Haas wrote:

SELECT SUM(1) FROM (SELECT * FROM randomtext ORDER BY t) x;

[13% faster with patch for C collation; 7% faster for UTF8]

I had hoped for more like a 15-20% gain from this approach, but
it didn't happen, I suppose because some of the instructions
saved just resulted in more processor stalls. All the same, I'm
inclined to think it's still worth doing.

This is a border case, but I suggest that a 13% speedup on a
narrowly-tailored benchmark, degrading to 7% in common
configurations, is too meager to justify adopting this patch.

We use the C collation and have character strings in most indexes
and ORDER BY clauses. Unless there are significant contra-
indications, I'm in favor of adopting this patch.

-Kevin

#4Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#1)
Re: sortsupport for text

On Fri, Mar 2, 2012 at 8:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:

12789    28.2686  libc-2.13.so             strcoll_l
6802     15.0350  postgres                 text_cmp

I'm still curious how it would compare to call strxfrm and sort the
resulting binary blobs. I don't think the sortsupport stuff actually
makes this any easier though. Since using it requires storing the
binary blob somewhere I think the support would have to be baked into
tuplesort (or hacked into the sortkey as an expr that was evaluated
earlier somehow).

It's a tradeoff and not an obvious one. The binary blobs are larger
and it would mean reading and copying more data around memory. But it
would mean doing the work that strcoll_l does only n times instead of
nlogn times. That might be a pretty significant gain.

--
greg

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#4)
Re: sortsupport for text

Greg Stark <stark@mit.edu> writes:

I'm still curious how it would compare to call strxfrm and sort the
resulting binary blobs.

In principle that should be a win; it's hard to believe that strxfrm
would have gotten into the standards if it were not a win for sorting
applications.

I don't think the sortsupport stuff actually
makes this any easier though. Since using it requires storing the
binary blob somewhere I think the support would have to be baked into
tuplesort (or hacked into the sortkey as an expr that was evaluated
earlier somehow).

Well, obviously something has to be done, but I think it might be
possible to express this as another sortsupport API function rather than
doing anything as ugly as hardwiring strxfrm into the callers.

However, it occurred to me that we could pretty easily jury-rig
something that would give us an idea about the actual benefit available
here. To wit: make a C function that wraps strxfrm, basically
strxfrm(text) returns bytea. Then compare the performance of
ORDER BY text_col to ORDER BY strxfrm(text_col).

(You would need to have either both or neither of text and bytea
using the sortsupport code paths for this to be a fair comparison.)

One other thing I've always wondered about in this connection is the
general performance of sorting toasted datums. Is it better to detoast
them in every comparison, or pre-detoast to save comparison cycles at
the cost of having to push much more data around? I didn't see any
discussion of this point in Robert's benchmarks, but I don't think we
should go very far towards enabling sortsupport for text until we
understand the issue and know whether we need to add more infrastructure
for it. If you cross your eyes a little bit, this is very much like
the strxfrm question...

regards, tom lane

#6Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#4)
Re: sortsupport for text

On Sat, Mar 17, 2012 at 6:58 PM, Greg Stark <stark@mit.edu> wrote:

On Fri, Mar 2, 2012 at 8:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:

12789    28.2686  libc-2.13.so             strcoll_l
6802     15.0350  postgres                 text_cmp

I'm still curious how it would compare to call strxfrm and sort the
resulting binary blobs. I don't think the sortsupport stuff actually
makes this any easier though. Since using it requires storing the
binary blob somewhere I think the support would have to be baked into
tuplesort (or hacked into the sortkey as an expr that was evaluated
earlier somehow).

Well, the real problem here is that the strxfrm'd representations
aren't just bigger - they are huge. On MacBook Pro, if the input
representation is n characters, the strxfrm'd representation is 9x+3
characters. If the we're sorting very wide tuples of which the sort
key is only a small part, maybe that would be acceptable, but if the
sort key makes up the bulk of the tuple than caching the strxfrm()
representation works out to slashing work_mem tenfold. That might be
just fine if the sort is going to fit in work_mem either way, but if
it turns a quicksort into a heap sort then I feel pretty confident
that it's going to be a loser. Keep in mind that even if the
strxfrm'd representation were no larger at all, it would still amount
to an additional copy of the data, so you'd still potentially be
eating up lots of work_mem that way. The fact that it's an order of
magnitude larger is just making a bad problem worse.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#7Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#5)
Re: sortsupport for text

On Sun, Mar 18, 2012 at 11:08 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

However, it occurred to me that we could pretty easily jury-rig
something that would give us an idea about the actual benefit available
here.  To wit: make a C function that wraps strxfrm, basically
strxfrm(text) returns bytea.  Then compare the performance of
ORDER BY text_col to ORDER BY strxfrm(text_col).

(You would need to have either both or neither of text and bytea
using the sortsupport code paths for this to be a fair comparison.)

Since the index will be ~9x bigger at least on this machine, I think I
know the answer, but I suppose it doesn't hurt to test it. It's not
that much work.

One other thing I've always wondered about in this connection is the
general performance of sorting toasted datums.  Is it better to detoast
them in every comparison, or pre-detoast to save comparison cycles at
the cost of having to push much more data around?  I didn't see any
discussion of this point in Robert's benchmarks, but I don't think we
should go very far towards enabling sortsupport for text until we
understand the issue and know whether we need to add more infrastructure
for it.  If you cross your eyes a little bit, this is very much like
the strxfrm question...

It would be surprising to me if there is a one-size-fits-all answer to
this question. For example, if the tuple got toasted because it's got
lots of columns and we had to take desperate measures to get it to fit
into an 8kB block at all, chances are that detoasting will work out
well. We'll use a bit more memory, but hopefully that'll be repaid by
much faster comparisons. OTOH, if you have a data set with many
relatively short strings and a few very long ones, detoasting up-front
could turn a quicksort into a heapsort. Since only a small fraction
of the comparisons would have involved one of the problematic long
strings anyway, it's unlikely to be worth the expense of keeping those
strings around in detoasted form for the entire sort (unless maybe
reconstructing them even a few times is problematic because we're
under heavy cache pressure and we get lots of disk seeks as a result).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#8Martijn van Oosterhout
kleptog@svana.org
In reply to: Robert Haas (#6)
Re: sortsupport for text

On Mon, Mar 19, 2012 at 12:19:53PM -0400, Robert Haas wrote:

On Sat, Mar 17, 2012 at 6:58 PM, Greg Stark <stark@mit.edu> wrote:

On Fri, Mar 2, 2012 at 8:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:

12789    28.2686  libc-2.13.so             strcoll_l
6802     15.0350  postgres                 text_cmp

I'm still curious how it would compare to call strxfrm and sort the
resulting binary blobs. I don't think the sortsupport stuff actually
makes this any easier though. Since using it requires storing the
binary blob somewhere I think the support would have to be baked into
tuplesort (or hacked into the sortkey as an expr that was evaluated
earlier somehow).

Well, the real problem here is that the strxfrm'd representations
aren't just bigger - they are huge. On MacBook Pro, if the input
representation is n characters, the strxfrm'd representation is 9x+3
characters.

Ouch. I was holding out hope that you could get a meaningful
improvement if we could use the first X bytes of the strxfrm output so
you only need to do a strcoll on strings that actually nearly match.
But with an information density of 9 bytes for one 1 character it
doesn't seem worthwhile.

That and this gem in the strxfrm manpage:

RETURN VALUE
The strxfrm() function returns the number of bytes required to
store the transformed string in dest excluding the terminating
'\0' character. If the value returned is n or more, the
contents of dest are indeterminate.

Which means that you have to take the entire transformed string, you
can't just ask for the first bit. I think that kind of leaves the whole
idea dead in the water.

Just for interest I looked at the ICU API for this and they have the
same restriction. There is another function which you can use to
return partial sort keys (ucol_nextSortKeyPart) but it produces
"uncompressed sortkeys", which it seems is what Mac OSX is doing, which
seems useless for our purposes. Either this is a hard problem or we're
nowhere near a target use case.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

He who writes carelessly confesses thereby at the very outset that he does
not attach much importance to his own thoughts.

-- Arthur Schopenhauer

#9Bruce Momjian
bruce@momjian.us
In reply to: Martijn van Oosterhout (#8)
Re: sortsupport for text

On Mon, Mar 19, 2012 at 9:23 PM, Martijn van Oosterhout
<kleptog@svana.org> wrote:

Ouch. I was holding out hope that you could get a meaningful
improvement if we could use the first X bytes of the strxfrm output so
you only need to do a strcoll on strings that actually nearly match.
But with an information density of 9 bytes for one 1 character it
doesn't seem worthwhile.

When I was playing with glibc it was 4n. I think what they do is have
n bytes for the high order bits, then n bytes for low order bits like
capitalization or whitespace differences. I suspect they used to use
16 bits for each and have gone to some larger size.

That and this gem in the strxfrm manpage:

RETURN VALUE
      The  strxfrm()  function returns the number of bytes required to
      store the transformed string in dest excluding the terminating
      '\0' character.  If the value returned is n or more, the
      contents of dest are indeterminate.

Which means that you have to take the entire transformed string, you
can't just ask for the first bit. I think that kind of leaves the whole
idea dead in the water.

I believe the intended API is that you allocate a buffer with your
guess of the right size, call strxfrm and if it returns a larger
number you realloc your buffer and call it again.

--
greg

In reply to: Tom Lane (#5)
Re: sortsupport for text

On 18 March 2012 15:08, Tom Lane <tgl@sss.pgh.pa.us> wrote:

One other thing I've always wondered about in this connection is the
general performance of sorting toasted datums.  Is it better to detoast
them in every comparison, or pre-detoast to save comparison cycles at
the cost of having to push much more data around?  I didn't see any
discussion of this point in Robert's benchmarks, but I don't think we
should go very far towards enabling sortsupport for text until we
understand the issue and know whether we need to add more infrastructure
for it.  If you cross your eyes a little bit, this is very much like
the strxfrm question...

I see the parallels. I note that glibc's strcoll_l() is implemented
entirely in C (strcoll() itself is implemented in terms of strcoll_l()
), whereas the various strcmp.S are written in hand-optimized
assembler, with SSE3 instructions in the "Highly optimized version for
x86-64", for example. I wonder just how important a factor that is. I
suppose the reason why the glibc guys haven't just done something
equivalent internally might be that they much prefer to perform the
comparison in-place, due to the need to target a conservative lowest
common denominator...or it could be because it just doesn't matter
that much.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

#11Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#10)
Re: sortsupport for text

On Thu, Jun 14, 2012 at 11:36 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote:

On 18 March 2012 15:08, Tom Lane <tgl@sss.pgh.pa.us> wrote:

One other thing I've always wondered about in this connection is the
general performance of sorting toasted datums.  Is it better to detoast
them in every comparison, or pre-detoast to save comparison cycles at
the cost of having to push much more data around?  I didn't see any
discussion of this point in Robert's benchmarks, but I don't think we
should go very far towards enabling sortsupport for text until we
understand the issue and know whether we need to add more infrastructure
for it.  If you cross your eyes a little bit, this is very much like
the strxfrm question...

I see the parallels.

The problem with pre-detoasting to save comparison cycles is that you
can now fit many, many fewer tuples in work_mem. There might be cases
where it wins (for example, because the entire data set fits even
after decompressing everything) but in most cases it seems like a
loser.

Also, my guess is that most values people sort by are pretty short,
making this concern mostly academic. Suppose you are sorting a bunch
of strings which might be either 100 characters in length or 1MB. If
they're all 100 characters, you probably don't need to detoast. If
they're all 1MB, you probably can't detoast without eating up a ton of
memory (and even if you have it, this might not be the best use for
it). If you have a mix, detoasting might be affordable provided that
the percentage of long strings is small, but it's also not going to
save you much, because if the percentage of long strings is small,
then most comparisons will be between two short strings where we don't
save anything anyway.

All things considered, this seems to me to be aiming at a pretty
narrow target, but maybe I'm just not thinking about it creatively
enough.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In reply to: Robert Haas (#11)
Re: sortsupport for text

On 14 June 2012 17:35, Robert Haas <robertmhaas@gmail.com> wrote:

The problem with pre-detoasting to save comparison cycles is that you
can now fit many, many fewer tuples in work_mem.  There might be cases
where it wins (for example, because the entire data set fits even
after decompressing everything) but in most cases it seems like a
loser.

If I had to guess, I'd say you're probably right about that -
optimising sorting toasted text doesn't seem like a terribly sensible
use of your time.

What about the strxfrm suggestion of Greg's? You might find that the
added benefit of being able to avail of a highly optimised strcmp()
tipped the balance in favour of that idea, beyond the simple fact that
there's only a linear number of what you might loosely call "strcoll_l
units of work" rather than as many as O(n ^ 2). Furthermore, I'd
speculate that if you were to interlace the strxfrm() calls with
copying each text string, somewhere like within a specialised
datumCopy(), that would make the approach more efficient still, as you
specify a location for the blob in the just-palloc()'d leading-key
private memory directly, rather than just using memcpy.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

In reply to: Robert Haas (#1)
Re: sortsupport for text

On 2 March 2012 20:45, Robert Haas <robertmhaas@gmail.com> wrote:

I decided to investigate the possible virtues of allowing "text" to
use the sortsupport infrastructure, since strings are something people
often want to sort.

I should mention up-front that I agree with the idea that it is worth
optimising text sorting because it is a very common thing to have to
do, and therefore the standard for inclusion ought to be lower. I
don't intend to talk about tapesort though - that isn't really fair,
not least because I have some serious doubts about the quality of our
implementation. Furthermore, I think that it is logical that doing
things like resolving collations occur within a preparatory function
in advance of sorting, rather than redundantly doing that for each and
every comparison.

Why have you made the reusable buffer managed by sortsupport
TEXTBUFLEN-aligned? The existing rationale for that constant (whose
value is 1024) does not seem to carry forward here:

* This should be large enough that most strings will be fit, but small
* enough that we feel comfortable putting it on the stack.

ISTM it would be on average worth the hit of having to repalloc a few
more times for larger strings by making that buffer much smaller
initially, and doubling its size each time that proved insufficient,
rather than increasing its size to the smallest possible
TEXTBUFLEN-aligned size that you can get away with for the immediately
subsequent memcpy. Realistically, any database I've ever worked with
would probably be able to fit a large majority of its text strings
into 16 chars of memory - you yourself said that sorting toasted text
isn't at all common.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

#14Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#12)
Re: sortsupport for text

On Thu, Jun 14, 2012 at 1:56 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:

On 14 June 2012 17:35, Robert Haas <robertmhaas@gmail.com> wrote:

The problem with pre-detoasting to save comparison cycles is that you
can now fit many, many fewer tuples in work_mem.  There might be cases
where it wins (for example, because the entire data set fits even
after decompressing everything) but in most cases it seems like a
loser.

If I had to guess, I'd say you're probably right about that -
optimising sorting toasted text doesn't seem like a terribly sensible
use of your time.

What about the strxfrm suggestion of Greg's? You might find that the
added benefit of being able to avail of a highly optimised strcmp()
tipped the balance in favour of that idea, beyond the simple fact that
there's only a linear number of what you might loosely call "strcoll_l
units of work" rather than as many as O(n ^ 2). Furthermore, I'd
speculate that if you were to interlace the strxfrm() calls with
copying each text string, somewhere like within a specialised
datumCopy(), that would make the approach more efficient still, as you
specify a location for the blob in the just-palloc()'d leading-key
private memory directly, rather than just using memcpy.

Well, it's still got the problem of blowing up memory usage. I just
can't get excited about optimizing for the case where we can consume
10x the memory and still fit in work_mem. If we've got that case, the
sort is gonna be pretty fast anyway. The case where preprocessing
wins is when there are going to be a large number of comparisons
against each tuple - i.e. lg(N) is large. But the cases where we
could pre-transform with strxfrm are those where the data fits in a
small percentage of work mem - i.e. lg(N) is small. I'm open to
somebody showing up with a test result that demonstrates that it's
worthwhile, but to me it seems like it's chasing diminishing returns.

The point of this patch isn't really to improve things for the
collation-aware case, although it's nice that it does. The point is
rather to shave off a double-digit percentage off the time it takes to
do the sort required to build a C-collation index, which is what
people should be using when they don't care about < and >, which most
don't. Despite Tom's concerns, I don't think there's anything in this
patch that can't be fairly easily revised at a later date if we decide
we want a different API. I think it's worth picking the low-hanging
fruit in the meantime.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#15Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#13)
Re: sortsupport for text

On Thu, Jun 14, 2012 at 2:10 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:

Why have you made the reusable buffer managed by sortsupport
TEXTBUFLEN-aligned? The existing rationale for that constant (whose
value is 1024) does not seem to carry forward here:

 * This should be large enough that most strings will be fit, but small
 * enough that we feel comfortable putting it on the stack.

Well, as the comments explain:

+       /*
+        * We avoid repeated palloc/pfree on long strings by keeping the buffers
+        * we allocate around for the duration of the sort.  When we
expand them,
+        * we round off the to the next multiple of TEXTBUFLEN in order to avoid
+        * repeatedly expanding them by very small amounts.
+        */

ISTM it would be on average worth the hit of having to repalloc a few
more times for larger strings by making that buffer much smaller
initially, and doubling its size each time that proved insufficient,
rather than increasing its size to the smallest possible
TEXTBUFLEN-aligned size that you can get away with for the immediately
subsequent memcpy. Realistically, any database I've ever worked with
would probably be able to fit a large majority of its text strings
into 16 chars of memory - you yourself said that sorting toasted text
isn't at all common.

I thought that doubling repeatedly would be overly aggressive in terms
of memory usage. Blowing the buffers out to 8kB because we hit a
string that's a bit over 4kB isn't so bad, but blowing them out to
256MB because we hit a string that's a bit over 128MB seems a bit
excessive.

Also, I don't think it really saves anything from a performance point
of view. The worst case for this is if we're asked to repeatedly
compare strings that expand the buffer by a kilobyte each time.
First, how likely is that to happen in a real world test case? In
many cases, most of the input strings will be of approximately equal
length; also in many cases, that length will be short; even if the
lengths take the worst possible distribution, we have to hit them in
the worst possible order for this to be a problem. Second, even if it
does happen, does it really matter? Suppose we expand the buffer a
kilobyte at a time from an initial size of 1kB all the way out to
256MB. That's 256,000 palloc calls, so we must be sorting at least
256,000 datums, at least 128,000 of which are longer than 128MB. I
think the cost of calling memcpy() and strcoll() repeatedly on all
those long datums - not to mention repeatedly detoasting them - is
going to bludgeon the palloc overhead into complete insignificance.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In reply to: Robert Haas (#15)
Re: sortsupport for text

On 14 June 2012 19:28, Robert Haas <robertmhaas@gmail.com> wrote:

I thought that doubling repeatedly would be overly aggressive in terms
of memory usage.  Blowing the buffers out to 8kB because we hit a
string that's a bit over 4kB isn't so bad, but blowing them out to
256MB because we hit a string that's a bit over 128MB seems a bit
excessive.

That's pretty much what all popular dynamic array implementations do,
from C++'s std::vector to Python's list (it's a misnomer). Having to
allocate 256MB for a buffer to contain a string a bit over 128MB may
seem excessive, until you later get a 250MB string. Even if doubling
is generally excessive, which I doubt, that's beside the point, which
is that expanding the array by some constant proportion results in
each insertion taking amortized constant time.

 Suppose we expand the buffer a
kilobyte at a time from an initial size of 1kB all the way out to
256MB.  That's 256,000 palloc calls, so we must be sorting at least
256,000 datums, at least 128,000 of which are longer than 128MB.  I
think the cost of calling memcpy() and strcoll() repeatedly on all
those long datums - not to mention repeatedly detoasting them - is
going to bludgeon the palloc overhead into complete insignificance.

I fail to understand how this sortsupport buffer fundamentally differs
from a generic dynamic array abstraction built to contain chars. That
being the case, I see no reason not to just do what everyone else does
when expanding dynamic arrays, and no reason why we shouldn't make
essentially the same time-space trade-off here as others do elsewhere.

Another concern is that it seems fairly pointless to have two buffers.
Wouldn't it be more sensible to have a single buffer that was
partitioned to make two logical, equally-sized buffers, given that in
general each buffer is expected to grow at exactly the same rate?

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

#17Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#16)
Re: sortsupport for text

On Thu, Jun 14, 2012 at 3:24 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:

On 14 June 2012 19:28, Robert Haas <robertmhaas@gmail.com> wrote:

I thought that doubling repeatedly would be overly aggressive in terms
of memory usage.  Blowing the buffers out to 8kB because we hit a
string that's a bit over 4kB isn't so bad, but blowing them out to
256MB because we hit a string that's a bit over 128MB seems a bit
excessive.

That's pretty much what all popular dynamic array implementations do,
from C++'s std::vector to Python's list (it's a misnomer). Having to
allocate 256MB for a buffer to contain a string a bit over 128MB may
seem excessive, until you later get a 250MB string. Even if doubling
is generally excessive, which I doubt, that's beside the point, which
is that expanding the array by some constant proportion results in
each insertion taking amortized constant time.

Yeah, but *it doesn't matter*. If you test this on strings that are
long enough that they get pushed out to TOAST, you'll find that it
doesn't measurably improve performance, because the overhead of
detoasting so completely dominates any savings on the palloc side that
you can't pick them out of the inter-run noise. Risking eating up an
extra 100MB of memory that we don't really need in order to obtain a
performance optimization that is far too small to measure does not
make sense. The case with std::vector is not analagous; they don't
have any way of knowing what other overhead you are incurring between
insertions into the vector, so it's reasonable to support that the
cost of the vector insertions themselves might be material. Here we
know that it doesn't matter, so the application of Knuth's first law
of optimization is appropriate.

 Suppose we expand the buffer a
kilobyte at a time from an initial size of 1kB all the way out to
256MB.  That's 256,000 palloc calls, so we must be sorting at least
256,000 datums, at least 128,000 of which are longer than 128MB.  I
think the cost of calling memcpy() and strcoll() repeatedly on all
those long datums - not to mention repeatedly detoasting them - is
going to bludgeon the palloc overhead into complete insignificance.

I fail to understand how this sortsupport buffer fundamentally differs
from a generic dynamic array abstraction built to contain chars. That
being the case, I see no reason not to just do what everyone else does
when expanding dynamic arrays, and no reason why we shouldn't make
essentially the same time-space trade-off here as others do elsewhere.

Another concern is that it seems fairly pointless to have two buffers.
Wouldn't it be more sensible to have a single buffer that was
partitioned to make two logical, equally-sized buffers, given that in
general each buffer is expected to grow at exactly the same rate?

Sure, but it would be making the code more complicated in return for
no measurable performance benefit. We generally avoid that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In reply to: Robert Haas (#17)
Re: sortsupport for text

On 14 June 2012 20:32, Robert Haas <robertmhaas@gmail.com> wrote:

Yeah, but *it doesn't matter*.  If you test this on strings that are
long enough that they get pushed out to TOAST, you'll find that it
doesn't measurably improve performance, because the overhead of
detoasting so completely dominates any savings on the palloc side that
you can't pick them out of the inter-run noise.

That's probably true, but it's also beside the point. As recently as a
few hours ago, you yourself said "my guess is that most values people
sort by are pretty short, making this concern mostly academic". Why
are you getting hung up on toasting now?

Here we know that it doesn't matter, so the application of Knuth's first law
of optimization is appropriate.

I'm not advocating some Byzantine optimisation, or even something that
could reasonably be described as an optimisation at all here. I'm
questioning why you've unnecessarily complicated the code by having
the buffer size just big enough to fit the biggest value seen so far,
but arbitrarily aligned to a value that is completely irrelevant to
bttextfastcmp_locale(), rather than using simple geometric expansion,
which is more or less the standard way of managing the growth of a
dynamic array.

You have to grow the array in some way. The basic approach I've
outlined has something to recommend it - why does it make sense to
align the size of the buffer to TEXTBUFLEN in particular though? It's
quite easy to imagine what you've done here resulting in an excessive
number of allocations (and pfree()s), which *could* be expensive. If
you're so conservative about allocating memory, don't grow the array
at quite so aggressive a rate as doubling it each time.

There is a trade-off between space and time to be made here, but I
don't know why you think that the right choice is to use almost the
smallest possible amount of memory in all cases.

Another concern is that it seems fairly pointless to have two buffers.
Wouldn't it be more sensible to have a single buffer that was
partitioned to make two logical, equally-sized buffers, given that in
general each buffer is expected to grow at exactly the same rate?

Sure, but it would be making the code more complicated in return for
no measurable performance benefit.  We generally avoid that.

Fair enough.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

#19Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#18)
Re: sortsupport for text

On Thu, Jun 14, 2012 at 6:30 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:

Here we know that it doesn't matter, so the application of Knuth's first law
of optimization is appropriate.

I'm not advocating some Byzantine optimisation, or even something that
could reasonably be described as an optimisation at all here. I'm
questioning why you've unnecessarily complicated the code by having
the buffer size just big enough to fit the biggest value seen so far,
but arbitrarily aligned to a value that is completely irrelevant to
bttextfastcmp_locale(), rather than using simple geometric expansion,
which is more or less the standard way of managing the growth of a
dynamic array.

Well, so... palloc, and most malloc implementations, don't actually
just double every time forever. They double up to some relatively
small number like 1MB, and then they expand in 1MB chunks.
std::vector may well behave as you're describing, but that's doing
something completely different. We're not accumulating a string of
unknown length into a buffer, with the risk of having to copy the
whole buffer every time we reallocate. We know the exact size of the
buffer we need for any given string, so the cost of a pfree/palloc is
only the cost of releasing and allocating memory; there is no
additional copying cost, as there would be for std::vector.

Look at it another way. Right now, the worst case number of temporary
buffer allocations is about 2N lg N, assuming we do about N lg N
comparisons during a sort. If we simply implemented the most naive
buffer reallocation algorithm possible, we might in the worst case
reallocate each buffer up to N times. That is already better than
what we do now by a factor of lg N. If we suppose N is about a
million, that's an improvement of 20x *in the worst case* - typically,
the improvement will be much larger, because all the strings will be
of similar length or we won't hit them in exactly increasing order of
length. The algorithm I've actually implemented bounds the worst case
1024 times more tightly - given N strings, we can't need to enlarge
the buffer more than N/1024 times. So with N of around a million,
this algorithm should eliminate *at least* 99.995% of the pallocs we
currently do . How much better does it need to be?

You have to grow the array in some way. The basic approach I've
outlined has something to recommend it - why does it make sense to
align the size of the buffer to TEXTBUFLEN in particular though?

There's nothing magic about TEXTBUFLEN as far as that goes. We could
round to the nearest multiple of 8kB or whatever if that seemed
better. But if we rounded to, say, the nearest 1MB, then someone
sorting strings that are 2kB long would use 2MB of memory for these
buffers. Considering that we ship with work_mem = 1MB, that could
easily end up wasting almost twice work_mem. I don't see how you can
view that as a trivial problem. It might be worth sucking that up if
it seemed likely to dramatically improve performance, but it doesn't.

It's
quite easy to imagine what you've done here resulting in an excessive
number of allocations (and pfree()s), which *could* be expensive.

Actually, I can't imagine that at all, per the above analysis. If I
could, I might agree with you. :-)

There is a trade-off between space and time to be made here, but I
don't know why you think that the right choice is to use almost the
smallest possible amount of memory in all cases.

Because I think that with the current implementation I can have my
cake and eat it, too. I believe I've gotten essentially all the
available speedup for essentially no memory wastage. If you can
convince me that I've missed something and there is still a meaningful
amount of palloc overhead left to be squeezed out, I'm all ears.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#16)
Re: sortsupport for text

Peter Geoghegan <peter@2ndquadrant.com> writes:

On 14 June 2012 19:28, Robert Haas <robertmhaas@gmail.com> wrote:

I thought that doubling repeatedly would be overly aggressive in terms
of memory usage.

I fail to understand how this sortsupport buffer fundamentally differs
from a generic dynamic array abstraction built to contain chars. That
being the case, I see no reason not to just do what everyone else does
when expanding dynamic arrays, and no reason why we shouldn't make
essentially the same time-space trade-off here as others do elsewhere.

I agree with Peter on this one; not only is double-each-time the most
widespread plan, but it is what we do in just about every other place
in Postgres that needs a dynamically expansible buffer. If you do it
randomly differently here, readers of the code will be constantly
stopping to wonder why it's different here and if that's a bug or not.
(And from a performance standpoint, I'm not entirely convinced it's not
a bug, anyway. Worst-case behavior could be pretty bad.)

regards, tom lane

#21Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#20)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#21)
#23Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#22)
In reply to: Robert Haas (#23)
#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#23)
In reply to: Tom Lane (#5)
In reply to: Peter Geoghegan (#26)
#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#27)
In reply to: Tom Lane (#28)
#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#29)
In reply to: Robert Haas (#1)
#32Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#31)
In reply to: Tom Lane (#32)
In reply to: Peter Geoghegan (#33)
#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#33)
In reply to: Tom Lane (#35)
In reply to: Peter Geoghegan (#36)
In reply to: Peter Geoghegan (#37)
#39Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Peter Geoghegan (#38)
In reply to: Kevin Grittner (#39)
#41Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Peter Geoghegan (#40)
In reply to: Kevin Grittner (#41)
#43Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Peter Geoghegan (#42)
In reply to: Kevin Grittner (#43)
#45Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kevin Grittner (#43)
In reply to: Peter Geoghegan (#44)
#47Peter Eisentraut
peter_e@gmx.net
In reply to: Peter Geoghegan (#33)
In reply to: Peter Eisentraut (#47)
In reply to: Peter Geoghegan (#44)
#50Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#32)
In reply to: Bruce Momjian (#50)
#52Bruce Momjian
bruce@momjian.us
In reply to: Peter Geoghegan (#51)
#53Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#49)
In reply to: Tom Lane (#53)
#55Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#54)
#56Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#55)
In reply to: Tom Lane (#55)
In reply to: Tom Lane (#55)
In reply to: Peter Geoghegan (#58)
#60Florian Pflug
fgp@phlo.org
In reply to: Peter Geoghegan (#57)
#61Florian Pflug
fgp@phlo.org
In reply to: Peter Geoghegan (#58)
In reply to: Florian Pflug (#61)
#63Florian Pflug
fgp@phlo.org
In reply to: Peter Geoghegan (#62)
In reply to: Florian Pflug (#63)
#65Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#25)
In reply to: Robert Haas (#65)
#67Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#66)
In reply to: Robert Haas (#67)
#69Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#68)
In reply to: Robert Haas (#69)
#71Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#70)
In reply to: Robert Haas (#71)
#73Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#72)
In reply to: Robert Haas (#73)
#75Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Peter Geoghegan (#74)