Memory prefetching while sequentially fetching from SortTuple array, tuplestore

Started by Peter Geogheganover 10 years ago24 messageshackers
Jump to latest

I've always thought that GCC intrinsics for software-based memory
prefetching are a very sharp tool. While it's really hard to use GCC's
__builtin_prefetch() effectively (I've tried enough times to know), I
always thought that there'd probably end up being a handful of cases
where using it presented a clear and worthwhile win. Hash joins seemed
to me to be a likely candidate, because memory latency is a real
killer there. However, I stumbled upon another case where prefetching
pays clear dividends while actually being quite simple: sequentially
reading an already-sorted memtuples array (an array of SortTuple
structs), during some kind of post-processing.

This comes up for many common and important cases, including regular
sort nodes (heaptuple sorts), datumsorts with pass-by-reference types
(e.g. ordered set aggregates), and B-Tree index builds, where the
sequential fetching occurs as actual B-Tree leaf pages are built from
the array.

Patch
=====

Attached patch adds a portable Postgres wrapper on the GCC intrinsic.
It also adds a client within tuplesort.c -- a common function that
seems like a good generic choke point. I can get a speed-up of 6% - 9%
for all of these cases by prefetching a few SortTuples ahead, for the
"tuple proper". (In-memory sorts only.)

ISTM that this alone makes the patch worthwhile, but we should still
consider the break-down of costs in each of these operations, and that
the cost of the sequential reading and processing of SortTuples is all
that has been touched by the patch. There are only about 3 real lines
of code added to tuplesort. Obviously, the dominant costs within each
of these queries/utility commands are the sort proper, and to a lesser
extent the heap scan preceding it. I am not targeting those at all --
I'm only targeting the tertiary cost of sequentially scanning the
memtuples array by hiding memory latency, and yet I end up with a
notable overall speed-up.

It's only really fair to look at this final sequential fetching cost
in isolation. The reduction in that cost in isolation seems to be
huge. I tried out the proprietary Intel vTune Amplifier utility with
this case, and it indicated that CPU time spent in calls to
_bt_buildadd was less than 1/8 the previous time in a comparable test
of the master branch, with no other changes that I can attribute to
this patch (i.e. there is a bit of noise). You can easily get some
hint of the size of the improvement by reviewing "trace_sort = on" log
output, and the reduction of time spent after "performsort" is done
but before the sort is shut down.

Tuplestores
----------------

tuplestore.c has a similar optimization added, on the theory that it
should match tuplesort.c wherever possible, and because it seems to
help a lot there too. For example, queries that perform a big CTE Scan
seem to benefit to about the same degree (although, again, only when
the array is fully in memory).

Portability concerns
===============

I started writing this patch with an intuition that this would help. I
arrived at the fixed-up version posted here through frobbing the code
based on the feedback of Postgres running on my laptop. I iteratively
tweaked the number of tuples ahead to fetch from, and the
__builtin_prefetch() temporal locality argument's value, which helped
a little for a number of test cases. If that doesn't inspire much
confidence in this patch, then good -- that's the idea. Obviously if
we are going to commit this, there needs to be testing of a reasonably
representative selection of platforms.

On the other hand, I couldn't find a case that was regressed, and I am
encouraged by the fact that this helps the post-processing of
SortTuples so effectively. I imagine that some platform would have
very different performance characteristics in order for this to be the
wrong thing. However, it seems possible that there is a case that has
been regressed, since the prefetch is added at a very generic point
within tuplesort.c and tuplestore.c.
--
Peter Geoghegan

Attachments:

0001-Prefetch-from-memtuples-array-in-tuplesort.patchtext/x-patch; charset=US-ASCII; name=0001-Prefetch-from-memtuples-array-in-tuplesort.patchDownload+101-1
In reply to: Peter Geoghegan (#1)
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore

On Thu, Jul 16, 2015 at 4:01 PM, Peter Geoghegan <pg@heroku.com> wrote:

Attached patch adds a portable Postgres wrapper on the GCC intrinsic.
It also adds a client within tuplesort.c -- a common function that
seems like a good generic choke point. I can get a speed-up of 6% - 9%
for all of these cases by prefetching a few SortTuples ahead, for the
"tuple proper". (In-memory sorts only.)

I added a silly bug during last minute clean-up. I attach a V2.

--
Peter Geoghegan

Attachments:

v2-prefetch-from-memtuples-array-in-tuplesort.patchtext/x-patch; charset=US-ASCII; name=v2-prefetch-from-memtuples-array-in-tuplesort.patchDownload+102-1
#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#1)
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore

Peter Geoghegan <pg@heroku.com> writes:

Attached patch adds a portable Postgres wrapper on the GCC intrinsic.

Meh. I don't like the assumption that non-GCC compilers will be smart
enough to optimize away the useless-to-them if() tests this adds.
Please refactor that so that there is exactly 0 new code when the
intrinsic doesn't exist.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Tom Lane (#3)
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore

On Thu, Jul 16, 2015 at 8:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Meh. I don't like the assumption that non-GCC compilers will be smart
enough to optimize away the useless-to-them if() tests this adds.
Please refactor that so that there is exactly 0 new code when the
intrinsic doesn't exist.

I imagined that there was some value in copying the GCC intrinsic's
behavior, and actually evaluating the "addr" expression even in the
event of no platform support. On reflection, I suppose that that isn't
actually a particularly useful property for Postgres. There will only
ever be a handful of callers.

Attached revision does not rely on such optimization occurring on
platforms that lack __builtin_prefetch(). This allowed me to decouple
availability from actual use, in the style of posix_fadvise(), so that
one can manually disable memory prefetching within pg_config_manual.h.

Clang is compatibile with __builtin_prefetch() intrinsic, FWIW. I'm
not sure if it's worth trying to make the wrapper portable across a
variety of supported compilers. If we were to attempt it, we would not
be the first. I note that ICC's memref_control has an identical
interface to __builtin_prefetch().

--
Peter Geoghegan

Attachments:

v3-prefetch-from-memtuples-array-in-tuplesort.patchtext/x-patch; charset=US-ASCII; name=v3-prefetch-from-memtuples-array-in-tuplesort.patchDownload+111-1
#5Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#4)
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore

On 2015-07-19 16:34:52 -0700, Peter Geoghegan wrote:

+# PGAC_C_BUILTIN_PREFETCH
+# -------------------------
+# Check if the C compiler understands __builtin_prefetch(),
+# and define HAVE__BUILTIN_PREFETCH if so.
+AC_DEFUN([PGAC_C_BUILTIN_PREFETCH],
+[AC_CACHE_CHECK(for __builtin_prefetch, pgac_cv__builtin_prefetch,
+[AC_COMPILE_IFELSE([AC_LANG_PROGRAM([],
+[int i = 0;__builtin_prefetch(&i, 0, 3);])],
+[pgac_cv__builtin_prefetch=yes],
+[pgac_cv__builtin_prefetch=no])])
+if test x"$pgac_cv__builtin_prefetch" = xyes ; then
+AC_DEFINE(HAVE__BUILTIN_PREFETCH, 1,
+          [Define to 1 if your compiler understands __builtin_prefetch.])
+fi])# PGAC_C_BUILTIN_PREFETCH

Hm. Is a compiler test actually test anything reliably here? Won't this
just throw a warning during compile time about an unknown function?

+/*
+ * Prefetch support -- Support memory prefetching hints on some platforms.
+ *
+ * pg_rfetch() is specialized for the case where an array is accessed
+ * sequentially, and we can prefetch a pointer within the next element (or an
+ * even later element) in order to hide memory latency.  This case involves
+ * prefetching addresses with low temporal locality.  Note that it's rather
+ * difficult to get any kind of speedup using pg_rfetch();  any use of the
+ * intrinsic should be carefully tested.  Also note that it's okay to pass it
+ * an invalid or NULL address, although it's best avoided.
+ */
+#if defined(USE_MEM_PREFETCH)
+#define pg_rfetch(addr)		__builtin_prefetch((addr), 0, 0)
+#endif

What is that name actually supposed to mean? 'rfetch' doesn't seem to be
particularly clear - or is it a meant to be a wordplay combined with the
p?

I don't think you should specify the read/write and locality parameters
if we don't hand-pick them - right now you're saying the memory will
only be read and that it has no temporal locality.

I think there should be a empty fallback definition even if the current
only caller ends up not needing it - not all callers will require it.

+					/*
+					 * Perform memory prefetch of tuple that's three places
+					 * ahead of current (which is returned to caller).
+					 * Testing shows that this significantly boosts the
+					 * performance for TSS_INMEM "forward" callers by
+					 * hiding memory latency behind their processing of
+					 * returned tuples.
+					 */
+#ifdef USE_MEM_PREFETCH
+					if (readptr->current + 3 < state->memtupcount)
+						pg_rfetch(state->memtuples[readptr->current + 3]);
+#endif

Hm. Why do we need a branch here? The target of prefetches don't need to
be valid addresses and adding a bunch of arithmetic and a branch for the
corner case doesn't sound worthwhile to me.

What worries me about adding explicit prefetching is that it's *highly*
platform and even micro-architecture dependent. Why is looking three
elements ahead the right value?

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Andres Freund (#5)
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore

On Wed, Sep 2, 2015 at 7:12 AM, Andres Freund <andres@anarazel.de> wrote:

On 2015-07-19 16:34:52 -0700, Peter Geoghegan wrote:
Hm. Is a compiler test actually test anything reliably here? Won't this
just throw a warning during compile time about an unknown function?

I'll need to look into that.

+/*
+ * Prefetch support -- Support memory prefetching hints on some platforms.
+ *
+ * pg_rfetch() is specialized for the case where an array is accessed
+ * sequentially, and we can prefetch a pointer within the next element (or an
+ * even later element) in order to hide memory latency.  This case involves
+ * prefetching addresses with low temporal locality.  Note that it's rather
+ * difficult to get any kind of speedup using pg_rfetch();  any use of the
+ * intrinsic should be carefully tested.  Also note that it's okay to pass it
+ * an invalid or NULL address, although it's best avoided.
+ */
+#if defined(USE_MEM_PREFETCH)
+#define pg_rfetch(addr)              __builtin_prefetch((addr), 0, 0)
+#endif

What is that name actually supposed to mean? 'rfetch' doesn't seem to be
particularly clear - or is it a meant to be a wordplay combined with the
p?

"Read fetch". One argument past to the intrinsic here specifies that
the variable will be read only. I did things this way because I
imagined that there would be very limited uses for the macro only. I
probably cover almost all interesting cases for explicit memory
prefetching already.

I don't think you should specify the read/write and locality parameters
if we don't hand-pick them - right now you're saying the memory will
only be read and that it has no temporal locality.

I think there should be a empty fallback definition even if the current
only caller ends up not needing it - not all callers will require it.

It started out that way, but Tom felt that it was better to have a
USE_MEM_PREFETCH because of the branch below...

+                                     /*
+                                      * Perform memory prefetch of tuple that's three places
+                                      * ahead of current (which is returned to caller).
+                                      * Testing shows that this significantly boosts the
+                                      * performance for TSS_INMEM "forward" callers by
+                                      * hiding memory latency behind their processing of
+                                      * returned tuples.
+                                      */
+#ifdef USE_MEM_PREFETCH
+                                     if (readptr->current + 3 < state->memtupcount)
+                                             pg_rfetch(state->memtuples[readptr->current + 3]);
+#endif

Hm. Why do we need a branch here? The target of prefetches don't need to
be valid addresses and adding a bunch of arithmetic and a branch for the
corner case doesn't sound worthwhile to me.

...which is needed, because I'm still required to not dereference wild
pointers. In other words, although pg_rfetch()/__builtin_prefetch()
does not require that a valid pointer be passed, it is not okay to
read past an array's bounds to read that pointer. The GCC docs are
clear on this -- "Data prefetch does not generate faults if addr is
invalid, but the address expression itself must be valid". Also, for
cases that don't benefit (like datum sorts, which never have a "tuple
proper" -- the above code is for tuplestore, not tuplesort, so you
can't see this), it seemed faster to have the branch than to rely on
__builtin_prefetch() being forgiving about invalid/wild pointers. I
saw a regression without the branch for that case.

What worries me about adding explicit prefetching is that it's *highly*
platform and even micro-architecture dependent. Why is looking three
elements ahead the right value?

Because that was the fastest value following testing on my laptop. You
are absolutely right to point out that this isn't a good reason to go
with the patch -- I share your concern. All I can say in defense of
that is that other major system software does the same, without any
regard for the underlying microarchitecture AFAICT. So, yes,
certainly, more testing is required across a reasonable cross-section
of platforms to justify the patch. But right now, time spent in
_bt_buildadd() is massively reduced with the patch, which makes it
worthwhile in my mind. It's really very effective if you look at the
only part of (say) a CREATE INDEX that is affected one way or another
-- the final fetching of tuples.

BTW, I am very close to posting a patch that makes (say) CREATE INDEX
very fast for external sorts. That uses a memory pool to make final
on-the-fly merge memory access sequential per tape. That's what I'll
do rather than explicitly prefetching for external sorts. It makes a
huge difference on top of the external sort work already posted,
because the final merge step was a big remaining bottleneck which I've
now fixed. An unlogged CREATE INDEX with 6 runs to merge is often only
15% slower than an equivalent CREATE INDEX that is all internal
(against the master branch).

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#6)
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore

On 2015-09-02 12:24:33 -0700, Peter Geoghegan wrote:

"Read fetch". One argument past to the intrinsic here specifies that
the variable will be read only. I did things this way because I
imagined that there would be very limited uses for the macro only. I
probably cover almost all interesting cases for explicit memory
prefetching already.

I'd be less brief in this case then, no need to be super short here.

I don't think you should specify the read/write and locality parameters
if we don't hand-pick them - right now you're saying the memory will
only be read and that it has no temporal locality.

I think there should be a empty fallback definition even if the current
only caller ends up not needing it - not all callers will require it.

It started out that way, but Tom felt that it was better to have a
USE_MEM_PREFETCH because of the branch below...

That doesn't mean we shouldn't still provide an empty definition.

Hm. Why do we need a branch here? The target of prefetches don't need to
be valid addresses and adding a bunch of arithmetic and a branch for the
corner case doesn't sound worthwhile to me.

...which is needed, because I'm still required to not dereference wild
pointers. > In other words, although pg_rfetch()/__builtin_prefetch()
does not require that a valid pointer be passed, it is not okay to
read past an array's bounds to read that pointer. The GCC docs are
clear on this -- "Data prefetch does not generate faults if addr is
invalid, but the address expression itself must be valid".

That's just a question of how to formulate this though?

pg_rfetch(((char *) state->memtuples ) + 3 * sizeof(SortTuple) + offsetof(SortTuple, tuple))?

For something heavily platform dependent like this that seems ok.

Because that was the fastest value following testing on my laptop. You
are absolutely right to point out that this isn't a good reason to go
with the patch -- I share your concern. All I can say in defense of
that is that other major system software does the same, without any
regard for the underlying microarchitecture AFAICT.

I know linux stripped out most prefetches at some point, even from x86
specific code, because it showed that they aged very badly. I.e. they
removed a bunch of them and stuff got faster, whereas they were
beneficial on earlier architectures.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Andres Freund (#7)
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore

On Wed, Sep 2, 2015 at 3:13 PM, Andres Freund <andres@anarazel.de> wrote:

I'd be less brief in this case then, no need to be super short here.

Okay.

It started out that way, but Tom felt that it was better to have a
USE_MEM_PREFETCH because of the branch below...

That doesn't mean we shouldn't still provide an empty definition.

Okay.

That's just a question of how to formulate this though?

pg_rfetch(((char *) state->memtuples ) + 3 * sizeof(SortTuple) + offsetof(SortTuple, tuple))?

For something heavily platform dependent like this that seems ok.

Well, still needs to work for tuplestore, which does not have a SortTuple.

Because that was the fastest value following testing on my laptop. You
are absolutely right to point out that this isn't a good reason to go
with the patch -- I share your concern. All I can say in defense of
that is that other major system software does the same, without any
regard for the underlying microarchitecture AFAICT.

I know linux stripped out most prefetches at some point, even from x86
specific code, because it showed that they aged very badly. I.e. they
removed a bunch of them and stuff got faster, whereas they were
beneficial on earlier architectures.

That is true, but IIRC that was specifically in relation to a commonly
used list data structure that had prefetches all over the place. That
was a pretty bad idea.

I think that explicit prefetching has extremely limited uses, too. The
only cases that I can imagine being helped are cases where there is
extremely predictable sequential access, but some pointer indirection.

Because of the way tuples are fetched across translation unit
boundaries in the cases addressed by the patch, it isn't hard to see
why the compiler does not do this automatically (prefetch instructions
added by the compiler are not common anyway, IIRC). The compiler has
no way of knowing that gettuple_common() is ultimately called from an
important inner loop, which could make all the difference, I suppose.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#8)
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore

On 2015-09-02 16:02:00 -0700, Peter Geoghegan wrote:

On Wed, Sep 2, 2015 at 3:13 PM, Andres Freund <andres@anarazel.de> wrote:

That's just a question of how to formulate this though?

pg_rfetch(((char *) state->memtuples ) + 3 * sizeof(SortTuple) + offsetof(SortTuple, tuple))?

For something heavily platform dependent like this that seems ok.

Well, still needs to work for tuplestore, which does not have a SortTuple.

Isn't it even more trivial there? It's just an array of void*'s? So
prefetch(state->memtuples + 3 + readptr->current)?

Because of the way tuples are fetched across translation unit
boundaries in the cases addressed by the patch, it isn't hard to see
why the compiler does not do this automatically (prefetch instructions
added by the compiler are not common anyway, IIRC).

Hardware prefetchers just have gotten to be rather good and obliterated
most of the cases where it's beneficial.

I'd be interested to see a perf stat -ddd comparison to the patch
with/without prefetches. It'll be interesting to see how the number of
cache hits/misses and prefetches changes.

Which microarchitecture did you test this on?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Andres Freund (#9)
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore

On Wed, Sep 2, 2015 at 4:12 PM, Andres Freund <andres@anarazel.de> wrote:

Well, still needs to work for tuplestore, which does not have a SortTuple.

Isn't it even more trivial there? It's just an array of void*'s? So
prefetch(state->memtuples + 3 + readptr->current)?

All I meant is that there couldn't be one centralized definition for
both. I don't mind if you want to bake it into a macro for
tupelstore.c and tuplesort.c.

Because of the way tuples are fetched across translation unit
boundaries in the cases addressed by the patch, it isn't hard to see
why the compiler does not do this automatically (prefetch instructions
added by the compiler are not common anyway, IIRC).

Hardware prefetchers just have gotten to be rather good and obliterated
most of the cases where it's beneficial.

Well, if hardware prefetchers are bright enough to do this perfectly
nowadays, then that implies a cost for useless prefetch instructions.
It might still be worth it even then, if the cost is very low for
these platforms. It's not as if they're required to respect the
prefetch hint in any way.

I'd be interested to see a perf stat -ddd comparison to the patch
with/without prefetches. It'll be interesting to see how the number of
cache hits/misses and prefetches changes.

Which microarchitecture did you test this on?

My laptop has an Intel Core i7-3520M, which is a mobile processor that
is a bit old. So, Ivy Bridge.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#10)
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore

On 2015-09-02 16:23:13 -0700, Peter Geoghegan wrote:

On Wed, Sep 2, 2015 at 4:12 PM, Andres Freund <andres@anarazel.de> wrote:

Well, still needs to work for tuplestore, which does not have a SortTuple.

Isn't it even more trivial there? It's just an array of void*'s? So
prefetch(state->memtuples + 3 + readptr->current)?

All I meant is that there couldn't be one centralized definition for
both. I don't mind if you want to bake it into a macro for
tupelstore.c and tuplesort.c.

I'm not following? Just write pg_read_prefetch(state->memtuples + 3 +
readptr->current) and the corresponding version for tuplesort in the
callsites?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Andres Freund (#11)
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore

On Wed, Sep 2, 2015 at 4:26 PM, Andres Freund <andres@anarazel.de> wrote:

I'm not following? Just write pg_read_prefetch(state->memtuples + 3 +
readptr->current) and the corresponding version for tuplesort in the
callsites?

Oh, I see. Maybe I'll do it that way when I pick this up in a little
while. I need to consider the "no tuple proper" (pass-by-value datum
sort) case within tuplesort.c, where stup.tuple is always NULL.
Currently, we discriminate against such SortTuples for various other
reasons (e.g. for memory accounting) by testing if the pointer is set,
so I suppose I copied that pattern here.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Peter Geoghegan (#6)
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore

On Wed, Sep 2, 2015 at 12:24 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Wed, Sep 2, 2015 at 7:12 AM, Andres Freund <andres@anarazel.de> wrote:

On 2015-07-19 16:34:52 -0700, Peter Geoghegan wrote:
Hm. Is a compiler test actually test anything reliably here? Won't this
just throw a warning during compile time about an unknown function?

I'll need to look into that.

My error here was not considering why the existing __builtin_bswap32()
test worked alright with AC_COMPILE_IFELSE(), while my test needed
AC_LINK_IFELSE(). I'll be sure to make a point of manually testing
failure (not just success) when writing compiler tests in the future.

AC_LINK_IFELSE() appears to work fine following a straight
substitution. When I get around to revising this patch in a while,
that fix will be included.

Thanks
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#13)
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore

On 2015-09-02 17:14:12 -0700, Peter Geoghegan wrote:

On Wed, Sep 2, 2015 at 12:24 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Wed, Sep 2, 2015 at 7:12 AM, Andres Freund <andres@anarazel.de> wrote:

On 2015-07-19 16:34:52 -0700, Peter Geoghegan wrote:
Hm. Is a compiler test actually test anything reliably here? Won't this
just throw a warning during compile time about an unknown function?

I'll need to look into that.

My error here was not considering why the existing __builtin_bswap32()
test worked alright with AC_COMPILE_IFELSE(), while my test needed
AC_LINK_IFELSE(). I'll be sure to make a point of manually testing
failure (not just success) when writing compiler tests in the future.

AC_LINK_IFELSE() appears to work fine following a straight
substitution. When I get around to revising this patch in a while,
that fix will be included.

Be sure to also test with optimizations enabled - often tests can get
optimized away if you don't force the result to be used (or volatile) :/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15David Rowley
dgrowleyml@gmail.com
In reply to: Peter Geoghegan (#6)
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore

On 3 September 2015 at 07:24, Peter Geoghegan <pg@heroku.com> wrote:

On Wed, Sep 2, 2015 at 7:12 AM, Andres Freund <andres@anarazel.de> wrote:

What worries me about adding explicit prefetching is that it's *highly*
platform and even micro-architecture dependent. Why is looking three
elements ahead the right value?

Because that was the fastest value following testing on my laptop. You
are absolutely right to point out that this isn't a good reason to go
with the patch -- I share your concern. All I can say in defense of
that is that other major system software does the same, without any
regard for the underlying microarchitecture AFAICT. So, yes,
certainly, more testing is required across a reasonable cross-section
of platforms to justify the patch.

FWIW someone else found 3 to be good on the platform they tested on:
http://www.naftaliharris.com/blog/2x-speedup-with-one-line-of-code/

Peter, would you be able to share the test case which you saw the speedup
on. So far I've been unable to see much of an improvement.

Regards

David Rowley

--
David Rowley http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Training & Services

In reply to: David Rowley (#15)
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore

On Wed, Sep 2, 2015 at 9:43 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:

Peter, would you be able to share the test case which you saw the speedup
on. So far I've been unable to see much of an improvement.

The case I tested was an internal sort CREATE INDEX. I don't recall
the exact details, but testing showed it to be a fairly robust
speedup. It was not a very brief CREATE INDEX operation, or a very
lengthily one. trace_sort output made it quite visible that there was
a significant saving after the sort is "performed", but before it is
"done". It wasn't hard to see an improvement on a variety of other
cases, although the Intel vTune tool made the difference particularly
obvious.

The only thing that definitely won't be helped is pass-by-value datum
sort cases. In case it matters, I used GCC 4.8.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17David Rowley
dgrowleyml@gmail.com
In reply to: Peter Geoghegan (#16)
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore

On 3 September 2015 at 16:50, Peter Geoghegan <pg@heroku.com> wrote:

On Wed, Sep 2, 2015 at 9:43 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:

Peter, would you be able to share the test case which you saw the speedup
on. So far I've been unable to see much of an improvement.

The case I tested was an internal sort CREATE INDEX. I don't recall
the exact details, but testing showed it to be a fairly robust
speedup. It was not a very brief CREATE INDEX operation, or a very
lengthily one. trace_sort output made it quite visible that there was
a significant saving after the sort is "performed", but before it is
"done". It wasn't hard to see an improvement on a variety of other
cases, although the Intel vTune tool made the difference particularly
obvious.

The only thing that definitely won't be helped is pass-by-value datum
sort cases. In case it matters, I used GCC 4.8.

My test cases are:

set work_mem ='1GB';
create table t1 as select md5(random()::text) from
generate_series(1,10000000);

Times are in milliseconds. Median and average over 10 runs.

Test 1
select count(distinct md5) from t1;

Master Patched Median 10,965.77 10,986.30 (99.81%) Average
10,983.63 11,013.55 (99.73%)

Test 2
select sum(rn) from (select row_number() over (order by md5) rn from t1) a;

Master Patched
Median 12,499.03 12,465.21 (100.27%)
Average 12,504.87 12,468.91 (100.29%)

Test 3
create index t1_md5_idx on t1(md5);
Master Patched
Median 12,981.47 12,888.11 (100.72%)
Average 13,416.23 13,249.32 (101.26%)

gcc version 4.8.3
Intel(R) Xeon(R) CPU E5-2630 v3

Are you seeing any speedup from any of these on your hardware?

Regards

David Rowley

--
David Rowley http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Training & Services

In reply to: David Rowley (#17)
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore

On Thu, Sep 3, 2015 at 5:35 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:

My test cases are:

Note that my text caching and unsigned integer comparison patches have
moved the baseline down quite noticeably. I think that my mobile
processor out-performs the Xeon you used for this, which seems a
little odd even taken the change in baseline performance into account.

set work_mem ='1GB';
create table t1 as select md5(random()::text) from
generate_series(1,10000000);

Times are in milliseconds. Median and average over 10 runs.

Test 1
select count(distinct md5) from t1;

Master Patched Median 10,965.77 10,986.30 (99.81%) Average
10,983.63 11,013.55 (99.73%)

Are you seeing any speedup from any of these on your hardware?

I gather that 10,965.77 here means 10,965 milliseconds, since that
roughly matches what I get.

For the sake of simplicity, I will focus on your test 1 as a baseline.
Note that I ran VACUUM FREEZE before any testing was performed, just
in case.

On my laptop:

postgres=# \dt+ t1
List of relations
Schema | Name | Type | Owner | Size | Description
--------+------+-------+-------+--------+-------------
public | t1 | table | pg | 651 MB |
(1 row)

Times for "Test 1", "select count(distinct md5) from t1":

Patch:

Time: 10076.870 ms
Time: 10094.873 ms
Time: 10125.253 ms <-- median
Time: 10222.042 ms
Time: 10269.247 ms

Master:

Time: 10641.142 ms
Time: 10706.181 ms
Time: 10708.860 ms < -- median
Time: 10725.426 ms
Time: 10781.398 ms

So, to answer your question: Yes, I can see a benefit for this query
on my test hardware (laptop), although it is not spectacular. It may
still be quite worthwhile.

I attach a revised version of the patch tested here, following
feedback from Andres. This should not make any difference to the
performance.

It's worth considering that for some (possibly legitimate) reason, the
built-in function call is ignored by your compiler, since GCC has
license to do that. You might try this on both master and patched
builds:

~/postgresql/src/backend/utils/sort$ gdb -batch -ex 'file tuplesort.o'
-ex 'disassemble tuplesort_gettuple_common' > prefetch_disassembly.txt

Then diff the file prefetch_disassembly.txt for each build to see what
the differences are in practice. Consider an extract of the output on
my system:

...
0x00000000000028ee <+926>: callq 0x28f3 <tuplesort_gettuple_common+931>
0x00000000000028f3 <+931>: nopl 0x0(%rax,%rax,1)
0x00000000000028f8 <+936>: sub $0x1,%eax
0x00000000000028fb <+939>: test %eax,%eax
0x00000000000028fd <+941>: mov %eax,0xd0(%rdi)
0x0000000000002903 <+947>: jne 0x25ce <tuplesort_gettuple_common+126>
0x0000000000002909 <+953>: jmpq 0x2710 <tuplesort_gettuple_common+448>
0x000000000000290e <+958>: xchg %ax,%ax
0x0000000000002910 <+960>: mov 0x58(%rdi),%rsi
0x0000000000002914 <+964>: lea (%rax,%rax,2),%rax
0x0000000000002918 <+968>: lea (%rsi,%rax,8),%rax
0x000000000000291c <+972>: mov 0x30(%rax),%rax
0x0000000000002920 <+976>: prefetchnta (%rax)
0x0000000000002923 <+979>: mov $0x1,%eax
0x0000000000002928 <+984>: jmpq 0x2712 <tuplesort_gettuple_common+450>
0x000000000000292d <+989>: nopl (%rax)
...

Notably, there is a prefetchnta instruction here.

Note that I'm going away on vacation in about a week. I wanted to give
people feedback on various things before then, since it was overdue.
FYI, after Thursday I will be very unlikely to answer e-mail for a
couple of weeks.

--
Peter Geoghegan

Attachments:

0001-Prefetch-from-memtuples-array-in-tuplesort.patchtext/x-patch; charset=US-ASCII; name=0001-Prefetch-from-memtuples-array-in-tuplesort.patchDownload+113-1
#19Amir Rohan
amir.rohan@zoho.com
In reply to: Peter Geoghegan (#18)
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore

On 10/11/2015 03:20 AM, Peter Geoghegan wrote:

On Thu, Sep 3, 2015 at 5:35 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:

My test cases are:

Note that my text caching and unsigned integer comparison patches have
moved the baseline down quite noticeably. I think that my mobile
processor out-performs the Xeon you used for this, which seems a
little odd even taken the change in baseline performance into account.

To add a caveat not yet mentioned, the idea behind prefetching is to
scarifice spare memory bandwidth for performance. That can be
a winning bet on a quiet box (the one we benchmark on), but can backfire
on production db when the extra memory pressure can degrade all running
queries. Something to test for, or at least keep in mind.

set work_mem ='1GB';
create table t1 as select md5(random()::text) from
generate_series(1,10000000);

Times are in milliseconds. Median and average over 10 runs.

Test 1

I am the reluctant owner of outmoded hardware. Namely a core2 from
around 2007 on plain spinning metal. My results (linux 64bit):

------
Test 1
------
set work_mem ='1GB';
select count(distinct md5) from t1;

== Master ==
42771.040 ms <- outlier?
41704.570 ms
41631.660 ms
41421.877 ms

== Patch ==
42469.911 ms <- outlier?
41378.556 ms
41375.870 ms
41118.105 ms
41096.283 ms
41095.705 ms

------
Test 2
------
select sum(rn) from (select row_number() over (order by md5) rn from t1) a;

== Master ==
44186.775 ms
44137.154 ms
44111.271 ms
44061.896 ms
44109.122 ms

== Patch ==
44592.590 ms
44403.076 ms
44276.170 ms

very slight difference in an ambiguous direction, but also no perf
catastrophe.

It's worth considering that for some (possibly legitimate) reason, the
built-in function call is ignored by your compiler, since GCC has
license to do that. You might try this on both master and patched
builds:

~/postgresql/src/backend/utils/sort$ gdb -batch -ex 'file tuplesort.o'
-ex 'disassemble tuplesort_gettuple_common' > prefetch_disassembly.txt

...

Notably, there is a prefetchnta instruction here.

I have verified the prefetech is emitted in the disassembly.

An added benefit of owning outmoded hardware is that the MSR for this
generation is public and I can disable individual prefetcher units by
twiddeling a bit.

Disabling the "HW prefetch" or the "DCU prefetch" units on a pacthed
version gave results that look relatively unchanged, which seems promising.
Disabling them both at once on an unpatched version shows a slowdown of
5-6% in test1 (43347.181, 43898.705, 43399.428). That gives an
indication of maximum potential gains in this direction, for this box
at least.

Finally, I notice my results are 4x slower than everyone else's.
That can be very tough on a man's pride, let me tell you.

Amir

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20David Rowley
dgrowleyml@gmail.com
In reply to: Peter Geoghegan (#18)
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore

On 11 October 2015 at 13:20, Peter Geoghegan <pg@heroku.com> wrote:

It's worth considering that for some (possibly legitimate) reason, the
built-in function call is ignored by your compiler, since GCC has
license to do that. You might try this on both master and patched
builds:

You're right, gcc did not include the prefetch instructions.
I've tested again on the same machine but with clang 3.7 instead of gcc
4.8.3

I've conducted the same tests again. All times are in milliseconds. Results
are the average and median over 10 runs.

set work_mem ='1GB';
create table t1 as select md5(random()::text) from
generate_series(1,10000000);
vacuum freeze t1;

Running 1 query at a time the results are as follows:

Test1: select count(distinct md5) from t1;

Master Patched Gain
Average 10853.679 10132.544 107.12%
Median 10754.193 10005.001 107.49%

Test2: select sum(rn) from (select row_number() over (order by md5) rn from
t1) a;
Master Patched Gain
Average 11495.8703 11475.0081 100.18%
Median 11495.6015 11455.944 100.35%

Test3: create index t1_md5_idx on t1(md5);
Master Patched Gain
Average 36464.4632 37830.3879 96.39%
Median 35946.608 36765.0055 97.77%

I also decided to run multiple queries at once, to see if there was any
cache pollution problems with the prefetching.

Test 1 pgbench -T 600 -c 16 -j 16 -f test1.sql -n
Test 2 pgbench -T 600 -c 16 -j 16 -f test2.sql -n

(tps output from pgbench was converted to milliseconds with 1/TPS*1000)
Master Patched Gain
Test 1 1375.413 1358.494 101.25%
Test 2 1594.753 1588.340 100.40%

CPU: 1 x Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz

I've attached a spreadsheet with all of the results.

--
David Rowley http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

results.odsapplication/vnd.oasis.opendocument.spreadsheet; name=results.odsDownload
In reply to: David Rowley (#20)
In reply to: Peter Geoghegan (#21)
In reply to: Peter Geoghegan (#22)
In reply to: Peter Geoghegan (#22)