CUDA Sorting

Started by Vitor Reusover 14 years ago41 messages

vitor.reus@gmail.com

over 14 years ago

Hello everyone,

I'm implementing a CUDA based sorting on PostgreSQL, and I believe it
can improve the ORDER BY statement performance in 4 to 10 times. I
already have a generic CUDA sort that performs around 10 times faster
than std qsort. I also managed to load CUDA into pgsql.

Since I'm new to pgsql development, I replaced the code of pgsql
qsort_arg to get used with the way postgres does the sort. The problem
is that I can't use the qsort_arg_comparator comparator function on
GPU, I need to implement my own. I didn't find out how to access the
sorting key value data of the tuples on the Tuplesortstate or
SortTuple structures. This part looks complicated because it seems the
state holds the pointer for the scanner(?), but I didn't managed to
access the values directly. Can anyone tell me how this works?

Cheers,
Vítor

Thom Brown

thom@linux.com

over 14 years ago

In reply to: Vitor Reus (#1)

Re: CUDA Sorting

On 19 September 2011 13:11, Vitor Reus <vitor.reus@gmail.com> wrote:

Hello everyone,

I'm implementing a CUDA based sorting on PostgreSQL, and I believe it
can improve the ORDER BY statement performance in 4 to 10 times. I
already have a generic CUDA sort that performs around 10 times faster
than std qsort. I also managed to load CUDA into pgsql.

Since I'm new to pgsql development, I replaced the code of pgsql
qsort_arg to get used with the way postgres does the sort. The problem
is that I can't use the qsort_arg_comparator comparator function on
GPU, I need to implement my own. I didn't find out how to access the
sorting key value data of the tuples on the Tuplesortstate or
SortTuple structures. This part looks complicated because it seems the
state holds the pointer for the scanner(?), but I didn't managed to
access the values directly. Can anyone tell me how this works?

I can't help with explaining the inner workings of sorting code, but
just a note that CUDA is a proprietary framework from nVidia and
confines its use to nVidia GPUs only. You'd probably be better off
investing in the OpenCL standard which is processor-agnostic. Work
has already been done in this area by Tim Child with pgOpenCL,
although doesn't appear to be available yet. It might be worth
engaging with him to see if there are commonalities to what you're
both trying to achieve.

--
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Thom Brown

thom@linux.com

over 14 years ago

In reply to: Vitor Reus (#1)

Re: CUDA Sorting

On 19 September 2011 14:32, Vitor Reus <vitor.reus@gmail.com> wrote:

2011/9/19 Thom Brown <thom@linux.com>:

On 19 September 2011 13:11, Vitor Reus <vitor.reus@gmail.com> wrote:

Hello everyone,

I'm implementing a CUDA based sorting on PostgreSQL, and I believe it
can improve the ORDER BY statement performance in 4 to 10 times. I
already have a generic CUDA sort that performs around 10 times faster
than std qsort. I also managed to load CUDA into pgsql.

Since I'm new to pgsql development, I replaced the code of pgsql
qsort_arg to get used with the way postgres does the sort. The problem
is that I can't use the qsort_arg_comparator comparator function on
GPU, I need to implement my own. I didn't find out how to access the
sorting key value data of the tuples on the Tuplesortstate or
SortTuple structures. This part looks complicated because it seems the
state holds the pointer for the scanner(?), but I didn't managed to
access the values directly. Can anyone tell me how this works?

I can't help with explaining the inner workings of sorting code, but
just a note that CUDA is a proprietary framework from nVidia and
confines its use to nVidia GPUs only. You'd probably be better off
investing in the OpenCL standard which is processor-agnostic. Work
has already been done in this area by Tim Child with pgOpenCL,
although doesn't appear to be available yet. It might be worth
engaging with him to see if there are commonalities to what you're
both trying to achieve.

--
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Hi Thom Brown,

thank you very much for your reply.

I am aware that CUDA is a proprietary framework, but since the high
level CUDA API is easier than OpenCL, it will be faster to implement
and test. Also, CUDA can be translated to OpenCL in a straightforward
way, since the low level CUDA API generated code is really similar to
OpenCL.

I'll try engaging with Tim Child, but it seems that his work is to
create GPU support for specific SQL, like procedural SQL statements
with CUDA extensions, did I understand it right? And my focus is to
"unlock" the GPU power without the user being aware of this.

Please use Reply To All in your responses so the mailing list is included.

Is your aim to have this committed into core PostgreSQL, or just for
your own version? If it's the former, I don't anticipate any
enthusiasm from the hacker community.

But you're right, Tim Child's work is aimed at procedural acceleration
rather than speeding up core functionality (from what I gather
anyway).

--
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Import Notes

Reply to msg id not found: CALf5ONqpN1B13isk2mc7DVxyq2POdb8GqRTdbWKGi2sSFV+Lw@mail.gmail.com

Greg Stark

stark@mit.edu

over 14 years ago

In reply to: Vitor Reus (#1)

Re: CUDA Sorting

On Mon, Sep 19, 2011 at 1:11 PM, Vitor Reus <vitor.reus@gmail.com> wrote:

Since I'm new to pgsql development, I replaced the code of pgsql
qsort_arg to get used with the way postgres does the sort. The problem
is that I can't use the qsort_arg_comparator comparator function on
GPU, I need to implement my own. I didn't find out how to access the
sorting key value data of the tuples on the Tuplesortstate or
SortTuple structures. This part looks complicated because it seems the
state holds the pointer for the scanner(?), but I didn't managed to
access the values directly. Can anyone tell me how this works?

This is something I've been curious about for a while. The biggest
difficulty is that Postgres has a user-extensible type system and
calls user provided functions to do things like comparisons. Postgres
only supports comparison sorts and does so by calling the user
function for the data type being sorted.

These user defined function is looked up earlier in the query parsing
and analysis phase and stored in Tuplesortstate->scanKeys which is an
array of structures that hold information about the ordering required.
In there there's a pointer to the function, a set of flags (such as
NULLS FIRST/LAST), the text collation needed and the collation.

I assume you're going to have to have tuplesort.c recognize if all the
comparators are one of a small set of standard comparators that you
can implement on the GPU such as integer and floating point
comparison. In which case you could call a specialized qsort which
implements that comparator inlined instead of calling the standard
function. That might actually be a useful optimization to do anyways
since it may well be much faster even without the GPU. So that would
probably be a good place to start.

But the barrier to get over here might be relatively high. In order to
tolerate that amount of duplicated code and special cases there would
have to be benchmarks showing it's significantly faster and helps
real-world user queries. It would also have to be pretty cleanly
implemented so that it doesn't impose a lot of extra overhead every
time this code needs to be changed -- for example when adding
collations it would have been unfortunate to have to add it to half a
dozen specializations of tuplesort (though frankly I don't think that
would have made that much of a dent in the happiness of the people who
worked on collations).

All that said my personal opinion is that this can be done cleanly and
would be more than worth the benefit even without the GPU -- sorting
integers and floating point numbers is a very common case and Peter
Geoghan recently showed our qsort could be about twice as fast if it
could inline the comparisons. With the GPU I'm curious to see how well
it handles multiple processes contending for resources, it might be a
flashy feature that gets lots of attention but might not really be
very useful in practice. But it would be very interesting to see.

--
greg

Greg Smith

greg@2ndQuadrant.com

over 14 years ago

In reply to: Greg Stark (#4)

Re: CUDA Sorting

On 09/19/2011 10:12 AM, Greg Stark wrote:

With the GPU I'm curious to see how well
it handles multiple processes contending for resources, it might be a
flashy feature that gets lots of attention but might not really be
very useful in practice. But it would be very interesting to see.

The main problem here is that the sort of hardware commonly used for
production database servers doesn't have any serious enough GPU to
support CUDA/OpenCL available. The very clear trend now is that all
systems other than gaming ones ship with motherboard graphics chipsets
more than powerful enough for any task but that. I just checked the 5
most popular configurations of server I see my customers deploy
PostgreSQL onto (a mix of Dell and HP units), and you don't get a
serious GPU from any of them.

Intel's next generation Ivy Bridge chipset, expected for the spring of
2012, is going to add support for OpenCL to the built-in motherboard
GPU. We may eventually see that trickle into the server hardware side
of things too.

I've never seen a PostgreSQL server capable of running CUDA, and I don't
expect that to change.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us

Thom Brown

thom@linux.com

over 14 years ago

In reply to: Greg Smith (#5)

Re: CUDA Sorting

On 19 September 2011 15:36, Greg Smith <greg@2ndquadrant.com> wrote:

On 09/19/2011 10:12 AM, Greg Stark wrote:

With the GPU I'm curious to see how well
it handles multiple processes contending for resources, it might be a
flashy feature that gets lots of attention but might not really be
very useful in practice. But it would be very interesting to see.

The main problem here is that the sort of hardware commonly used for
production database servers doesn't have any serious enough GPU to support
CUDA/OpenCL available. The very clear trend now is that all systems other
than gaming ones ship with motherboard graphics chipsets more than powerful
enough for any task but that. I just checked the 5 most popular
configurations of server I see my customers deploy PostgreSQL onto (a mix of
Dell and HP units), and you don't get a serious GPU from any of them.

Intel's next generation Ivy Bridge chipset, expected for the spring of 2012,
is going to add support for OpenCL to the built-in motherboard GPU. We may
eventually see that trickle into the server hardware side of things too.

I've never seen a PostgreSQL server capable of running CUDA, and I don't
expect that to change.

But couldn't that also be seen as a chicken/egg situation? No-one
buys GPUs for database servers because the database won't make use of
it, but databases don't implement GPU functionality since database
servers don't tend to have GPUs. It's more likely the latter of those
two reasonings would have to be the first to budge.

But nVidia does produce a non-graphics-oriented GPGPU line called
Tesla dedicated to such processing.

--
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Greg Stark

stark@mit.edu

over 14 years ago

In reply to: Greg Smith (#5)

Re: CUDA Sorting

On Mon, Sep 19, 2011 at 3:36 PM, Greg Smith <greg@2ndquadrant.com> wrote:

The main problem here is that the sort of hardware commonly used for
production database servers doesn't have any serious enough GPU to support
CUDA/OpenCL available

Of course that could change if adding a GPU would help Postgres... I
would expect it to help mostly for data warehouse batch query type
systems, especially ones with very large i/o subsystems that can
saturate the memory bus with sequential i/o. "Run your large batch
queries twice as fast by adding a $400 part to your $40,000 server"
might be a pretty compelling sales pitch :)

That said, to help in the case I described you would have to implement
the tapesort algorithm on the GPU as well. I expect someone has
implemented heaps for CUDA/OpenCL already though.

--
greg

Thom Brown

thom@linux.com

over 14 years ago

In reply to: Greg Stark (#7)

Re: CUDA Sorting

On 19 September 2011 15:54, Greg Stark <stark@mit.edu> wrote:

On Mon, Sep 19, 2011 at 3:36 PM, Greg Smith <greg@2ndquadrant.com> wrote:

The main problem here is that the sort of hardware commonly used for
production database servers doesn't have any serious enough GPU to support
CUDA/OpenCL available

Of course that could change if adding a GPU would help Postgres... I
would expect it to help mostly for data warehouse batch query type
systems, especially ones with very large i/o subsystems that can
saturate the memory bus with sequential i/o. "Run your large batch
queries twice as fast by adding a $400 part to your $40,000 server"
might be a pretty compelling sales pitch :)

That said, to help in the case I described you would have to implement
the tapesort algorithm on the GPU as well. I expect someone has
implemented heaps for CUDA/OpenCL already though.

I seem to recall a paper on such a thing by Carnegie Mellon
University. Can't remember where I saw it though.

--
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Thom Brown

thom@linux.com

over 14 years ago

In reply to: Thom Brown (#8)

Re: CUDA Sorting

On 19 September 2011 16:10, Thom Brown <thom@linux.com> wrote:

On 19 September 2011 15:54, Greg Stark <stark@mit.edu> wrote:

On Mon, Sep 19, 2011 at 3:36 PM, Greg Smith <greg@2ndquadrant.com> wrote:

The main problem here is that the sort of hardware commonly used for
production database servers doesn't have any serious enough GPU to support
CUDA/OpenCL available

Of course that could change if adding a GPU would help Postgres... I
would expect it to help mostly for data warehouse batch query type
systems, especially ones with very large i/o subsystems that can
saturate the memory bus with sequential i/o. "Run your large batch
queries twice as fast by adding a $400 part to your $40,000 server"
might be a pretty compelling sales pitch :)

That said, to help in the case I described you would have to implement
the tapesort algorithm on the GPU as well. I expect someone has
implemented heaps for CUDA/OpenCL already though.

I seem to recall a paper on such a thing by Carnegie Mellon
University. Can't remember where I saw it though.

Found it! http://www.cs.cmu.edu/afs/cs.cmu.edu/Web/People/ngm/15-823/project/Final.pdf

--
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10

Tom Lane

tgl@sss.pgh.pa.us

over 14 years ago

In reply to: Greg Stark (#7)

Re: CUDA Sorting

Greg Stark <stark@mit.edu> writes:

That said, to help in the case I described you would have to implement
the tapesort algorithm on the GPU as well.

I think the real problem would be that we are seldom sorting just the
key values. If you have to push the tuples through the GPU too, your
savings are going to go up in smoke pretty quickly ...

FWIW, I tend to believe a variant of what Greg Stark said upthread:
there would surely be some win from reducing the impedance mismatch for
comparison functions. In concrete terms, there would be no reason to
have tuplesort.c's myFunctionCall2Coll, and maybe not
inlineApplySortFunction either, if the datatype-specific comparison
functions had APIs that were closer to what sorting wants rather than
following the general SQL-callable-function API. And those functions
cost a *lot* more than a one-instruction comparison does. But it's very
much more of a stretch to believe that inlining per se is going to do
much for us, and even more of a stretch to believe that getting a
separate processor involved is going to be a win.

regards, tom lane

#11

Vitor Reus

vitor.reus@gmail.com

over 14 years ago

In reply to: Tom Lane (#10)

Re: CUDA Sorting

2011/9/19 Thom Brown <thom@linux.com>

Is your aim to have this committed into core PostgreSQL, or just for
your own version? If it's the former, I don't anticipate any
enthusiasm from the hacker community.

This is a research thesis and I'm not confident to commit it on the
core just by myself. I will, however, release the source, and I
believe it will open the way to future work be committed on core
PostgreSQL.

2011/9/19 Greg Stark <stark@mit.edu>

Of course that could change if adding a GPU would help Postgres... I
would expect it to help mostly for data warehouse batch query type
systems, especially ones with very large i/o subsystems that can
saturate the memory bus with sequential i/o. "Run your large batch
queries twice as fast by adding a $400 part to your $40,000 server"
might be a pretty compelling sales pitch :)

My focus is also energy proportionality. If you add a GPU, you will
increase the power consumption in about 2 times, but perhaps could
increse the efficiency much more.

That said, to help in the case I described you would have to implement
the tapesort algorithm on the GPU as well. I expect someone has
implemented heaps for CUDA/OpenCL already though.

For now, I'm planning to implement just the in-memory sort, for
simplicity and to see if it would give a real performance gain.

2011/9/19 Greg Stark <stark@mit.edu>:

In which case you could call a specialized qsort which
implements that comparator inlined instead of calling the standard
function.

Actually I'm now trying to make a custom comparator for integers, but
I didn't had great progress. If this works, I'll port it to GPU and
start working with the next comparators, such as float, then strings,
in a incremental way.

2011/9/19 Thom Brown <thom@linux.com>:

Found it! http://www.cs.cmu.edu/afs/cs.cmu.edu/Web/People/ngm/15-823/project/Final.pdf

This is a really great work, and I'm basing mine on it. But it's
implemented using OpenGL (yes, not OpenCL), and therefore has a lot of
limitations. I also tried to contact naju but didn't get any answer.

Vítor Uwe Reus

#12

Nulik Nol

nuliknol@gmail.com

over 14 years ago

In reply to: Vitor Reus (#1)

Re: CUDA Sorting

On Mon, Sep 19, 2011 at 7:11 AM, Vitor Reus <vitor.reus@gmail.com> wrote:

Hello everyone,

I'm implementing a CUDA based sorting on PostgreSQL, and I believe it
can improve the ORDER BY statement performance in 4 to 10 times. I
already have a generic CUDA sort that performs around 10 times faster
than std qsort. I also managed to load CUDA into pgsql.

NVIDIA cards are not that good as ATI cards. ATI cards are much faster
with integer operations, and should be ideal for sorting transaction
ids or sort of similar numbers (unless you are going to sort prices
stored as float, which ATI still beats NVIDIA but not by that much)
Another problem you have to deal with is PCI Express speed. Transfer
is very slow compared to RAM. You will have to put more GPUs to match
the performance and this will increase solution cost. There was a
sorting algorithm for 4 CPU cores that was beating sort on a GTX 285
(I don't have the link, sorry), but CPUs are not that bad with sorting
like you think.
AMD is already working with embedding GPUs into the motherboard, if I
am not mistaken there are already some of them on the market available
for purchase.
Anyone who uses a tiny embedded ATI for sorting problems with integers
will outperform your NVIDIA based PCI-Express connected GPU with CUDA,
because basically your algorithm will waste a lot of time transfering
data to GPU and getting it back.
But if you use embedded ATI GPU , you can also use SSE registers on
each CPU core to add more performance to your algorithm. It is not
going to be a very hardware compatible solution but if you want good
speed/cost, this should be the best solution.
I recommend doing some bandwidth benchmark test before you start coding.

Regards
Nulik

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
==================================
The power of zero is infinite

#13

Christopher Browne

cbbrowne@gmail.com

over 14 years ago

In reply to: Greg Smith (#5)

Re: CUDA Sorting

On Mon, Sep 19, 2011 at 10:36 AM, Greg Smith <greg@2ndquadrant.com> wrote:

Intel's next generation Ivy Bridge chipset, expected for the spring of 2012,
is going to add support for OpenCL to the built-in motherboard GPU. We may
eventually see that trickle into the server hardware side of things too.

Note that Amazon's EC2 offerings include a configuration with a pair of GPUs.

Whether or not this continues has a certain "chicken and egg" aspect to it...

- I'm glad that Amazon is selling such a configuration, as it does
give folks the option of trying it out.

- Presumably, it will only continue on their product list if customers
do more than merely "trying it out."

I think I'd be shocked if PostgreSQL offered much support for such a
configuration in the next year; despite there being some work ongoing,
drawing the functionality into core would require Core decisions that
I'd be surprised to see so quickly.

Unfortunately, that may be slow enough progress that PostgreSQL won't
be contributing to the would-be success of the technology.

If this kind of GPU usage fails to attract much interest, then it's
probably a good thing that we're not committed to it. But if other
uses lead to it taking off, then we'll doubtless get a lot of noise on
lists about a year from now to the effect "Why don't you have this in
core yet? Not 3773t enough!?!?"

Having a bit of progress taking place now would probably be good
timing, in case it *does* take off...
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

#14

Vitor Reus

vitor.reus@gmail.com

over 14 years ago

In reply to: Nulik Nol (#12)

Re: CUDA Sorting

2011/9/19 Nulik Nol <nuliknol@gmail.com>:

On Mon, Sep 19, 2011 at 7:11 AM, Vitor Reus <vitor.reus@gmail.com> wrote:
I recommend doing some bandwidth benchmark test before you start coding.

I already did some benchmarks with GPU sorting (not in pgsql), and
measured total sort times, copy bandwidth and energy usage, and got
some exciting results:

I got around 1GB/s bandwidth with a GeForce GT 430 on a MS-9803 MB.
The power increase ratio was 2.75 times, In a Core 2 Duo T8300, adding
the GT 430: http://tinyurl.com/6h7cgv2
The sorting time performance increases when you have more data, but in
average is 7.8 times faster than CPU: http://tinyurl.com/6c95dc2

#15

Stephen Frost

sfrost@snowman.net

over 14 years ago

In reply to: Thom Brown (#6)

Re: CUDA Sorting

* Thom Brown (thom@linux.com) wrote:

But nVidia does produce a non-graphics-oriented GPGPU line called
Tesla dedicated to such processing.

Just as a side-note, I've got a couple Tesla's that aren't doing
terribly much at the moment and they're in a Linux 'server'-type box
from Penguin computing. I could certainly install PG on it and run some
tests- if someone's written the code and provides the tests.

I agree that it'd be interesting to do, but I share Lord Stark's
feelings about the challenges and lack of potential gain- it's a very
small set of queries that would benefit from this. You need to be
working with enough data to make the cost of tranferring it all over to
the GPU worthwhile, just for starters..

Thanks,

Stephen

#16

Greg Smith

greg@2ndQuadrant.com

over 14 years ago

In reply to: Thom Brown (#6)

Re: CUDA Sorting

On 09/19/2011 10:53 AM, Thom Brown wrote:

But couldn't that also be seen as a chicken/egg situation?

The chicken/egg problem here is a bit deeper than just "no one offers
GPUs because no one wants them" on server systems. One of the reasons
there aren't more GPUs in typical database server configurations is that
you're already filling up some number of the full size slots, and
correspondingly the bandwidth available to cards, with disk
controllers. It doesn't help that many server class motherboards don't
even have a x16 PCI-e slot on them, which is what most GPUs as delivered
on regular consumer video cards are optimized for.

But nVidia does produce a non-graphics-oriented GPGPU line called
Tesla dedicated to such processing.

Tesla units start at around $1500 USD, which is a nice budget to spend
on either more RAM (to allow higher work_mem), faster storage to store
temporary files onto, or a faster CPU to chew through all sorts of tasks
more quickly. The Tesla units are easy to justify if you have a serious
GPU-oriented application. The good bang for the buck point with CPU
sorting for PostgreSQL is probably going to be a $50-$100 video card
instead. For example, the card Vitor is seeing good results on costs
around $60. (That's also a system with fairly slow RAM, though; it will
be interesting to see if the gain holds up on newer systems.)

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us

#17

PostgreSQL - Hans-Jürgen Schönig

postgres@cybertec.at

over 14 years ago

In reply to: Tom Lane (#10)

Re: CUDA Sorting

On Sep 19, 2011, at 5:16 PM, Tom Lane wrote:

Greg Stark <stark@mit.edu> writes:

That said, to help in the case I described you would have to implement
the tapesort algorithm on the GPU as well.

I think the real problem would be that we are seldom sorting just the
key values. If you have to push the tuples through the GPU too, your
savings are going to go up in smoke pretty quickly …

i would argument along a similar line.
to make GPU code fast it has to be pretty much tailored to do exactly one thing - otherwise you have no chance to get anywhere close to card-bandwith.
if you look at "two similar" GPU codes which seem to do the same thing you might easily see that one is 10 times faster than the other - for bloody reason such as memory alignment, memory transaction size or whatever.
this opens a bit of a problem: PostgreSQL sorting is so generic and so flexible that i would be really surprised if somebody could come up with a solution which really comes close to what the GPU can do.
it would definitely be interesting to see a prototype, however.

btw, there is a handful of interesting talks / lectures about GPU programming provided by the university of chicago (just cannot find the link atm).

regards,

hans

--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de

#18

Cédric Villemain

cedric.villemain.debian@gmail.com

over 14 years ago

In reply to: Greg Smith (#16)

Re: CUDA Sorting

2011/9/19 Greg Smith <greg@2ndquadrant.com>:

On 09/19/2011 10:53 AM, Thom Brown wrote:

But couldn't that also be seen as a chicken/egg situation?

The chicken/egg problem here is a bit deeper than just "no one offers GPUs
because no one wants them" on server systems. One of the reasons there
aren't more GPUs in typical database server configurations is that you're
already filling up some number of the full size slots, and correspondingly
the bandwidth available to cards, with disk controllers. It doesn't help
that many server class motherboards don't even have a x16 PCI-e slot on
them, which is what most GPUs as delivered on regular consumer video cards
are optimized for.

Sandy bridge and ivy bridge intel series are CPU/GPU. I don't know how
using the GPU affect the CPU part but it might be interesting to
explore...

--
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation

#19

Florian Pflug

fgp@phlo.org

over 14 years ago

In reply to: Stephen Frost (#15)

Re: CUDA Sorting

On Sep19, 2011, at 19:46 , Stephen Frost wrote:

I agree that it'd be interesting to do, but I share Lord Stark's
feelings about the challenges and lack of potential gain- it's a very
small set of queries that would benefit from this. You need to be
working with enough data to make the cost of tranferring it all over to
the GPU worthwhile, just for starters..

I wonder if anyone has ever tried to employ a GPU for more low-level
tasks. Things like sorting or hashing are hard to move to the
GPU in postgres because, in the general case, they involve essentially
arbitrary user-defined functions. But couldn't for example the WAL CRC
computation be moved to a GPU? Or, to get really crazy, even the search
for the optimal join order (only for a large number of joins though,
i.e. where we currently switch to a genetic algorithmn)?

best regards,
Florian Pflug

#20

Nulik Nol

nuliknol@gmail.com

over 14 years ago

In reply to: Vitor Reus (#14)

Re: CUDA Sorting

I already did some benchmarks with GPU sorting (not in pgsql), and
measured total sort times, copy bandwidth and energy usage, and got
some exciting results:

Was that qsort implementation on CPU cache friendly and optimized for SSE ?
To make a fair comparison you have to take the best CPU implementation
and compare it to best GPU implementation. Because if not, you are
comparing full throttled GPU vs lazy CPU.
Check this paper on how hash join was optimized 17x when SSE
instructions were used.
www.vldb.org/pvldb/2/vldb09-257.pdf

Regards

--
==================================
The power of zero is infinite

#21

Hannu Krosing

hannu@2ndQuadrant.com

over 14 years ago

In reply to: Greg Stark (#4)

Re: CUDA Sorting

On Mon, 2011-09-19 at 15:12 +0100, Greg Stark wrote:

On Mon, Sep 19, 2011 at 1:11 PM, Vitor Reus <vitor.reus@gmail.com> wrote:

Since I'm new to pgsql development, I replaced the code of pgsql
qsort_arg to get used with the way postgres does the sort. The problem
is that I can't use the qsort_arg_comparator comparator function on
GPU, I need to implement my own. I didn't find out how to access the
sorting key value data of the tuples on the Tuplesortstate or
SortTuple structures. This part looks complicated because it seems the
state holds the pointer for the scanner(?), but I didn't managed to
access the values directly. Can anyone tell me how this works?

....

With the GPU I'm curious to see how well
it handles multiple processes contending for resources, it might be a
flashy feature that gets lots of attention but might not really be
very useful in practice. But it would be very interesting to see.

There are cases where concurrency may not be that important like some
specialized OLAP loads where you have to sort, for example finding a
median in large data sets.

--
-------
Hannu Krosing
PostgreSQL Unlimited Scalability and Performance Consultant
2ndQuadrant Nordic
PG Admin Book: http://www.2ndQuadrant.com/books/

#22

Hannu Krosing

hannu@krosing.net

over 14 years ago

In reply to: Greg Smith (#5)

Re: CUDA Sorting

On Mon, 2011-09-19 at 10:36 -0400, Greg Smith wrote:

On 09/19/2011 10:12 AM, Greg Stark wrote:

With the GPU I'm curious to see how well
it handles multiple processes contending for resources, it might be a
flashy feature that gets lots of attention but might not really be
very useful in practice. But it would be very interesting to see.

The main problem here is that the sort of hardware commonly used for
production database servers doesn't have any serious enough GPU to
support CUDA/OpenCL available. The very clear trend now is that all
systems other than gaming ones ship with motherboard graphics chipsets
more than powerful enough for any task but that. I just checked the 5
most popular configurations of server I see my customers deploy
PostgreSQL onto (a mix of Dell and HP units), and you don't get a
serious GPU from any of them.

Intel's next generation Ivy Bridge chipset, expected for the spring of
2012, is going to add support for OpenCL to the built-in motherboard
GPU. We may eventually see that trickle into the server hardware side
of things too.

I've never seen a PostgreSQL server capable of running CUDA, and I don't
expect that to change.

CUDA sorting could be beneficial on general server hardware if it can
run well on multiple cpus in parallel. GPU-s being in essence parallel
processors on fast shared memory, it may be that even on ordinary RAM
and lots of CPUs some CUDA algorithms are a significant win.

and then there is non-graphics GPU availabe on EC2

Cluster GPU Quadruple Extra Large Instance

22 GB of memory
33.5 EC2 Compute Units (2 x Intel Xeon X5570, quad-core “Nehalem”
architecture)
2 x NVIDIA Tesla “Fermi” M2050 GPUs
1690 GB of instance storage
64-bit platform
I/O Performance: Very High (10 Gigabit Ethernet)
API name: cg1.4xlarge

It costs $2.10 per hour, probably a lot less if you use the Spot
Instances.

Show quoted text

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us

#23

Vitor Reus

vitor.reus@gmail.com

over 14 years ago

In reply to: Hannu Krosing (#22)

Re: CUDA Sorting

Hey hackers,

I'm still having problems reading the values of the columns in tuplesort.c,
in order to understand how to port this to CUDA.

Should I use the heap_getattr macro to read them?

2011/9/24 Hannu Krosing <hannu@krosing.net>

Show quoted text

On Mon, 2011-09-19 at 10:36 -0400, Greg Smith wrote:

On 09/19/2011 10:12 AM, Greg Stark wrote:

With the GPU I'm curious to see how well
it handles multiple processes contending for resources, it might be a
flashy feature that gets lots of attention but might not really be
very useful in practice. But it would be very interesting to see.

The main problem here is that the sort of hardware commonly used for
production database servers doesn't have any serious enough GPU to
support CUDA/OpenCL available. The very clear trend now is that all
systems other than gaming ones ship with motherboard graphics chipsets
more than powerful enough for any task but that. I just checked the 5
most popular configurations of server I see my customers deploy
PostgreSQL onto (a mix of Dell and HP units), and you don't get a
serious GPU from any of them.

Intel's next generation Ivy Bridge chipset, expected for the spring of
2012, is going to add support for OpenCL to the built-in motherboard
GPU. We may eventually see that trickle into the server hardware side
of things too.

I've never seen a PostgreSQL server capable of running CUDA, and I don't
expect that to change.

CUDA sorting could be beneficial on general server hardware if it can
run well on multiple cpus in parallel. GPU-s being in essence parallel
processors on fast shared memory, it may be that even on ordinary RAM
and lots of CPUs some CUDA algorithms are a significant win.

and then there is non-graphics GPU availabe on EC2

Cluster GPU Quadruple Extra Large Instance

22 GB of memory
33.5 EC2 Compute Units (2 x Intel Xeon X5570, quad-core “Nehalem”
architecture)
2 x NVIDIA Tesla “Fermi” M2050 GPUs
1690 GB of instance storage
64-bit platform
I/O Performance: Very High (10 Gigabit Ethernet)
API name: cg1.4xlarge

It costs $2.10 per hour, probably a lot less if you use the Spot
Instances.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24

Gaetano Mendola

mendola@gmail.com

almost 14 years ago

In reply to: Greg Smith (#5)

Re: CUDA Sorting

On 19/09/2011 16:36, Greg Smith wrote:

On 09/19/2011 10:12 AM, Greg Stark wrote:

With the GPU I'm curious to see how well
it handles multiple processes contending for resources, it might be a
flashy feature that gets lots of attention but might not really be
very useful in practice. But it would be very interesting to see.

The main problem here is that the sort of hardware commonly used for
production database servers doesn't have any serious enough GPU to
support CUDA/OpenCL available. The very clear trend now is that all
systems other than gaming ones ship with motherboard graphics chipsets
more than powerful enough for any task but that. I just checked the 5
most popular configurations of server I see my customers deploy
PostgreSQL onto (a mix of Dell and HP units), and you don't get a
serious GPU from any of them.

Intel's next generation Ivy Bridge chipset, expected for the spring of
2012, is going to add support for OpenCL to the built-in motherboard
GPU. We may eventually see that trickle into the server hardware side of
things too.

The trend is to have server capable of running CUDA providing GPU via
external hardware (PCI Express interface with PCI Express switches),
look for example at PowerEdge C410x PCIe Expansion Chassis from DELL.

I did some experimenst timing the sort done with CUDA and the sort done
with pg_qsort:
CUDA pg_qsort
33Milion integers: ~ 900 ms, ~ 6000 ms
1Milion integers: ~ 21 ms, ~ 162 ms
100k integers: ~ 2 ms, ~ 13 ms

CUDA time has already in the copy operations (host->device, device->host).

As GPU I was using a C2050, and the CPU doing the pg_qsort was a
Intel(R) Xeon(R) CPU X5650 @ 2.67GHz

Copy operations and kernel runs (the sort for instance) can run in
parallel, so while you are sorting a batch of data, you can copy the
next batch in parallel.

As you can see the boost is not negligible.

Next Nvidia hardware (Keplero family) is PCI Express 3 ready, so expect
in the near future the "bottle neck" of the device->host->device copies
to have less impact.

I strongly believe there is space to provide modern database engine of
a way to offload sorts to GPU.

I've never seen a PostgreSQL server capable of running CUDA, and I
don't expect that to change.

That sounds like:

"I think there is a world market for maybe five computers."
- IBM Chairman Thomas Watson, 1943

Regards
Gaetano Mendola

#25

Gaetano Mendola

mendola@gmail.com

almost 14 years ago

In reply to: PostgreSQL - Hans-Jürgen Schönig (#17)

Re: CUDA Sorting

On 19/09/2011 21:41, PostgreSQL - Hans-Jürgen Schönig wrote:

On Sep 19, 2011, at 5:16 PM, Tom Lane wrote:

Greg Stark<stark@mit.edu> writes:

That said, to help in the case I described you would have to implement
the tapesort algorithm on the GPU as well.

I think the real problem would be that we are seldom sorting just the
key values. If you have to push the tuples through the GPU too, your
savings are going to go up in smoke pretty quickly …

i would argument along a similar line.
to make GPU code fast it has to be pretty much tailored to do exactly one thing - otherwise you have no chance to get anywhere close to card-bandwith.
if you look at "two similar" GPU codes which seem to do the same thing you might easily see that one is 10 times faster than the other - for bloody reason such as memory alignment, memory transaction size or whatever.
this opens a bit of a problem: PostgreSQL sorting is so generic and so flexible that i would be really surprised if somebody could come up with a solution which really comes close to what the GPU can do.
it would definitely be interesting to see a prototype, however.

Thrust Nvidia library provides the same sorting flexibility as postgres
does.

// generate 32M random numbers on the host
thrust::host_vector<int> h_vec(32 << 20);
thrust::generate(h_vec.begin(), h_vec.end(), rand);

// transfer data to the device
thrust::device_vector<int> d_vec = h_vec;

// sort data on the device (846M keys per second on GeForce GTX 480)
thrust::sort(d_vec.begin(), d_vec.end());

// transfer data back to host
thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());

as you can see the type to be ordered is template, and
the thrust::sort have also a version in where it takes the comparator to
use.
So compared with pg_qsort thrust::sort gives you the same flexibility.

http://docs.thrust.googlecode.com/hg/group__sorting.html

Regards
Gaetano Mendola

#26

Oleg Bartunov

oleg@sai.msu.su

almost 14 years ago

In reply to: Gaetano Mendola (#24)

Re: CUDA Sorting

I'm wondering if CUDA will win in geomentry operations, for example,
tesing point <@ complex_polygon

Oleg
On Sun, 12 Feb 2012, Gaetano Mendola wrote:

On 19/09/2011 16:36, Greg Smith wrote:

On 09/19/2011 10:12 AM, Greg Stark wrote:

With the GPU I'm curious to see how well
it handles multiple processes contending for resources, it might be a
flashy feature that gets lots of attention but might not really be
very useful in practice. But it would be very interesting to see.

The main problem here is that the sort of hardware commonly used for
production database servers doesn't have any serious enough GPU to
support CUDA/OpenCL available. The very clear trend now is that all
systems other than gaming ones ship with motherboard graphics chipsets
more than powerful enough for any task but that. I just checked the 5
most popular configurations of server I see my customers deploy
PostgreSQL onto (a mix of Dell and HP units), and you don't get a
serious GPU from any of them.

Intel's next generation Ivy Bridge chipset, expected for the spring of
2012, is going to add support for OpenCL to the built-in motherboard
GPU. We may eventually see that trickle into the server hardware side of
things too.

The trend is to have server capable of running CUDA providing GPU via
external hardware (PCI Express interface with PCI Express switches), look for
example at PowerEdge C410x PCIe Expansion Chassis from DELL.

I did some experimenst timing the sort done with CUDA and the sort done with
pg_qsort:
CUDA pg_qsort
33Milion integers: ~ 900 ms, ~ 6000 ms
1Milion integers: ~ 21 ms, ~ 162 ms
100k integers: ~ 2 ms, ~ 13 ms

CUDA time has already in the copy operations (host->device, device->host).

As GPU I was using a C2050, and the CPU doing the pg_qsort was a Intel(R)
Xeon(R) CPU X5650 @ 2.67GHz

Copy operations and kernel runs (the sort for instance) can run in parallel,
so while you are sorting a batch of data, you can copy the next batch in
parallel.

As you can see the boost is not negligible.

Next Nvidia hardware (Keplero family) is PCI Express 3 ready, so expect in
the near future the "bottle neck" of the device->host->device copies to have
less impact.

I strongly believe there is space to provide modern database engine of
a way to offload sorts to GPU.

I've never seen a PostgreSQL server capable of running CUDA, and I
don't expect that to change.

That sounds like:

"I think there is a world market for maybe five computers."
- IBM Chairman Thomas Watson, 1943

Regards
Gaetano Mendola

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

#27

Gaetano Mendola

mendola@gmail.com

almost 14 years ago

In reply to: Oleg Bartunov (#26)

Re: CUDA Sorting

On 12/02/2012 13:13, Oleg Bartunov wrote:

I'm wondering if CUDA will win in geomentry operations, for example,
tesing point <@ complex_polygon

I'm not sure if the algorithm you mentioned can be implemented in terms
of vector algebra, blas, etc.

It's plenty of geometry operations implemented in CUDA out there, my
field of CUDA application is not this one so I'm not that much in it.

However I can point you to official NVIDIA npp library that provides
vector algebra algorithms, and some geometry algorithms as well.

http://developer.download.nvidia.com/compute/DevZone/docs/html/CUDALibraries/doc/NPP_Library.pdf

(take a look at around page 620).

Regards
Gaetano Mendola

Show quoted text

Oleg
On Sun, 12 Feb 2012, Gaetano Mendola wrote:

On 19/09/2011 16:36, Greg Smith wrote:

On 09/19/2011 10:12 AM, Greg Stark wrote:

With the GPU I'm curious to see how well
it handles multiple processes contending for resources, it might be a
flashy feature that gets lots of attention but might not really be
very useful in practice. But it would be very interesting to see.

The main problem here is that the sort of hardware commonly used for
production database servers doesn't have any serious enough GPU to
support CUDA/OpenCL available. The very clear trend now is that all
systems other than gaming ones ship with motherboard graphics chipsets
more than powerful enough for any task but that. I just checked the 5
most popular configurations of server I see my customers deploy
PostgreSQL onto (a mix of Dell and HP units), and you don't get a
serious GPU from any of them.

Intel's next generation Ivy Bridge chipset, expected for the spring of
2012, is going to add support for OpenCL to the built-in motherboard
GPU. We may eventually see that trickle into the server hardware side of
things too.

The trend is to have server capable of running CUDA providing GPU via
external hardware (PCI Express interface with PCI Express switches),
look for example at PowerEdge C410x PCIe Expansion Chassis from DELL.

I did some experimenst timing the sort done with CUDA and the sort
done with pg_qsort:
CUDA pg_qsort
33Milion integers: ~ 900 ms, ~ 6000 ms
1Milion integers: ~ 21 ms, ~ 162 ms
100k integers: ~ 2 ms, ~ 13 ms

CUDA time has already in the copy operations (host->device,
device->host).

As GPU I was using a C2050, and the CPU doing the pg_qsort was a
Intel(R) Xeon(R) CPU X5650 @ 2.67GHz

Copy operations and kernel runs (the sort for instance) can run in
parallel, so while you are sorting a batch of data, you can copy the
next batch in parallel.

As you can see the boost is not negligible.

Next Nvidia hardware (Keplero family) is PCI Express 3 ready, so
expect in the near future the "bottle neck" of the
device->host->device copies to have less impact.

I strongly believe there is space to provide modern database engine of
a way to offload sorts to GPU.

I've never seen a PostgreSQL server capable of running CUDA, and I
don't expect that to change.

That sounds like:

"I think there is a world market for maybe five computers."
- IBM Chairman Thomas Watson, 1943

Regards
Gaetano Mendola

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

#28

Greg Smith

greg@2ndQuadrant.com

almost 14 years ago

In reply to: Gaetano Mendola (#24)

Re: CUDA Sorting

On 02/11/2012 08:14 PM, Gaetano Mendola wrote:

The trend is to have server capable of running CUDA providing GPU via
external hardware (PCI Express interface with PCI Express switches),
look for example at PowerEdge C410x PCIe Expansion Chassis from DELL.

The C410X adds 16 PCIe slots to a server, housed inside a separate 3U
enclosure. That's a completely sensible purchase if your goal is to
build a computing cluster, where a lot of work is handed off to a set of
GPUs. I think that's even less likely to be a cost-effective option for
a database server. Adding a single dedicated GPU installed in a server
to accelerate sorting is something that might be justifiable, based on
your benchmarks. This is a much more expensive option than that
though. Details at
http://www.dell.com/us/enterprise/p/poweredge-c410x/pd for anyone who
wants to see just how big this external box is.

I did some experimenst timing the sort done with CUDA and the sort
done with pg_qsort:
CUDA pg_qsort
33Milion integers: ~ 900 ms, ~ 6000 ms
1Milion integers: ~ 21 ms, ~ 162 ms
100k integers: ~ 2 ms, ~ 13 ms
CUDA time has already in the copy operations (host->device,
device->host).
As GPU I was using a C2050, and the CPU doing the pg_qsort was a
Intel(R) Xeon(R) CPU X5650 @ 2.67GHz

That's really interesting, and the X5650 is by no means a slow CPU. So
this benchmark is providing a lot of CPU power yet still seeing over a
6X speedup in sort times. It sounds like the PCI Express bus has gotten
fast enough that the time to hand data over and get it back again can
easily be justified for medium to large sized sorts.

It would be helpful to take this patch and confirm whether it scales
when using in parallel. Easiest way to do that would be to use the
pgbench "-f" feature, which allows running an arbitrary number of some
query at once. Seeing whether this acceleration continued to hold as
the number of clients increases is a useful data point.

Is it possible for you to break down where the time is being spent? For
example, how much of this time is consumed in the GPU itself, compared
to time spent transferring data between CPU and GPU? I'm also curious
where the bottleneck is at with this approach. If it's the speed of the
PCI-E bus for smaller data sets, adding more GPUs may never be
practical. If the bus can handle quite a few of these at once before it
saturates, it might be possible to overload a single GPU. That seems
like it would be really hard to reach for database sorting though; I
can't really defend justify my gut feel for that being true though.

I've never seen a PostgreSQL server capable of running CUDA, and I
don't expect that to change.

That sounds like:

"I think there is a world market for maybe five computers."
- IBM Chairman Thomas Watson, 1943

Yes, and "640K will be enough for everyone", ha ha. (Having said the
640K thing is flat out denied by Gates, BTW, and no one has come up with
proof otherwise).

I think you've made an interesting case for this sort of acceleration
now being useful for systems doing what's typically considered a data
warehouse task. I regularly see servers waiting for far more than 13M
integers to sort. And I am seeing a clear trend toward providing more
PCI-E slots in servers now. Dell's R810 is the most popular single
server model my customers have deployed in the last year, and it has 5
X8 slots in it. It's rare all 5 of those are filled. As long as a
dedicated GPU works fine when dropped to X8 speeds, I know a fair number
of systems where one of those could be added now.

There's another data point in your favor I didn't notice before your
last e-mail. Amazon has a "Cluster GPU Quadruple Extra Large" node type
that runs with NVIDIA Tesla hardware. That means the installed base of
people who could consider CUDA is higher than I expected. To
demonstrate how much that costs, to provision a GPU enabled reserved
instance from Amazon for one year costs $2410 at "Light Utilization",
giving a system with 22GB of RAM and 1.69GB of storage. (I find the
reserved prices easier to compare with dedicated hardware than the
hourly ones) That's halfway between the High-Memory Double Extra Large
Instance (34GB RAM/850GB disk) at $1100 and the High-Memory Quadruple
Extra Large Instance (64GB RAM/1690GB disk) at $2200. If someone could
prove sorting was a bottleneck on their server, that isn't an
unreasonable option to consider on a cloud-based database deployment.

I still think that an approach based on OpenCL is more likely to be
suitable for PostgreSQL, which was part of why I gave CUDA low odds
here. The points in favor of OpenCL are:

-Since you last posted, OpenCL compiling has switched to using LLVM as
their standard compiler. Good PostgreSQL support for LLVM isn't far
away. It looks to me like the compiler situation for CUDA requires
their PathScale based compiler. I don't know enough about this area to
say which compiling tool chain will end up being easier to deal with.

-Intel is making GPU support standard for OpenCL, as I mentioned
before. NVIDIA will be hard pressed to compete with Intel for GPU
acceleration once more systems supporting that enter the market.

-Easy availability of OpenCL on Mac OS X for development sake. Lots of
Postgres hackers with OS X systems, even though there aren't too many OS
X database servers.

The fact that Amazon provides a way to crack the chicken/egg hardware
problem immediately helps a lot though, I don't even need a physical
card here to test CUDA GPU acceleration on Linux now. With that data
point, your benchmarks are good enough to say I'd be willing to help
review a patch in this area here as part of the 9.3 development cycle.
That may validate that GPU acceleration is useful, and then the next
step would be considering how portable that will be to other GPU
interfaces. I still expect CUDA will be looked back on as a dead end
for GPU accelerated computing one day. Computing history is not filled
with many single-vendor standards who competed successfully against
Intel providing the same thing. AMD's x86-64 is the only example I can
think of where Intel didn't win that sort of race, which happened (IMHO)
only because Intel's Itanium failed to prioritize backwards
compatibility highly enough.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

#29

Kohei KaiGai

kaigai@kaigai.gr.jp

almost 14 years ago

In reply to: Greg Smith (#28)

Re: CUDA Sorting

2012/2/13 Greg Smith <greg@2ndquadrant.com>:

On 02/11/2012 08:14 PM, Gaetano Mendola wrote:

The trend is to have server capable of running CUDA providing GPU via
external hardware (PCI Express interface with PCI Express switches), look
for example at PowerEdge C410x PCIe Expansion Chassis from DELL.

The C410X adds 16 PCIe slots to a server, housed inside a separate 3U
enclosure. That's a completely sensible purchase if your goal is to build a
computing cluster, where a lot of work is handed off to a set of GPUs. I
think that's even less likely to be a cost-effective option for a database
server. Adding a single dedicated GPU installed in a server to accelerate
sorting is something that might be justifiable, based on your benchmarks.
This is a much more expensive option than that though. Details at
http://www.dell.com/us/enterprise/p/poweredge-c410x/pd for anyone who wants
to see just how big this external box is.

I did some experimenst timing the sort done with CUDA and the sort done
with pg_qsort:
CUDA pg_qsort
33Milion integers: ~ 900 ms, ~ 6000 ms
1Milion integers: ~ 21 ms, ~ 162 ms
100k integers: ~ 2 ms, ~ 13 ms
CUDA time has already in the copy operations (host->device, device->host).
As GPU I was using a C2050, and the CPU doing the pg_qsort was a Intel(R)
Xeon(R) CPU X5650 @ 2.67GHz

That's really interesting, and the X5650 is by no means a slow CPU. So this
benchmark is providing a lot of CPU power yet still seeing over a 6X speedup
in sort times. It sounds like the PCI Express bus has gotten fast enough
that the time to hand data over and get it back again can easily be
justified for medium to large sized sorts.

It would be helpful to take this patch and confirm whether it scales when
using in parallel. Easiest way to do that would be to use the pgbench "-f"
feature, which allows running an arbitrary number of some query at once.
Seeing whether this acceleration continued to hold as the number of clients
increases is a useful data point.

Is it possible for you to break down where the time is being spent? For
example, how much of this time is consumed in the GPU itself, compared to
time spent transferring data between CPU and GPU? I'm also curious where
the bottleneck is at with this approach. If it's the speed of the PCI-E bus
for smaller data sets, adding more GPUs may never be practical. If the bus
can handle quite a few of these at once before it saturates, it might be
possible to overload a single GPU. That seems like it would be really hard
to reach for database sorting though; I can't really defend justify my gut
feel for that being true though.

I've never seen a PostgreSQL server capable of running CUDA, and I
don't expect that to change.

That sounds like:

"I think there is a world market for maybe five computers."
- IBM Chairman Thomas Watson, 1943

Yes, and "640K will be enough for everyone", ha ha. (Having said the 640K
thing is flat out denied by Gates, BTW, and no one has come up with proof
otherwise).

I think you've made an interesting case for this sort of acceleration now
being useful for systems doing what's typically considered a data warehouse
task. I regularly see servers waiting for far more than 13M integers to
sort. And I am seeing a clear trend toward providing more PCI-E slots in
servers now. Dell's R810 is the most popular single server model my
customers have deployed in the last year, and it has 5 X8 slots in it. It's
rare all 5 of those are filled. As long as a dedicated GPU works fine when
dropped to X8 speeds, I know a fair number of systems where one of those
could be added now.

There's another data point in your favor I didn't notice before your last
e-mail. Amazon has a "Cluster GPU Quadruple Extra Large" node type that
runs with NVIDIA Tesla hardware. That means the installed base of people
who could consider CUDA is higher than I expected. To demonstrate how much
that costs, to provision a GPU enabled reserved instance from Amazon for one
year costs $2410 at "Light Utilization", giving a system with 22GB of RAM
and 1.69GB of storage. (I find the reserved prices easier to compare with
dedicated hardware than the hourly ones) That's halfway between the
High-Memory Double Extra Large Instance (34GB RAM/850GB disk) at $1100 and
the High-Memory Quadruple Extra Large Instance (64GB RAM/1690GB disk) at
$2200. If someone could prove sorting was a bottleneck on their server,
that isn't an unreasonable option to consider on a cloud-based database
deployment.

I still think that an approach based on OpenCL is more likely to be suitable
for PostgreSQL, which was part of why I gave CUDA low odds here. The points
in favor of OpenCL are:

-Since you last posted, OpenCL compiling has switched to using LLVM as their
standard compiler. Good PostgreSQL support for LLVM isn't far away. It
looks to me like the compiler situation for CUDA requires their PathScale
based compiler. I don't know enough about this area to say which compiling
tool chain will end up being easier to deal with.

-Intel is making GPU support standard for OpenCL, as I mentioned before.
NVIDIA will be hard pressed to compete with Intel for GPU acceleration once
more systems supporting that enter the market.

-Easy availability of OpenCL on Mac OS X for development sake. Lots of
Postgres hackers with OS X systems, even though there aren't too many OS X
database servers.

The fact that Amazon provides a way to crack the chicken/egg hardware
problem immediately helps a lot though, I don't even need a physical card
here to test CUDA GPU acceleration on Linux now. With that data point, your
benchmarks are good enough to say I'd be willing to help review a patch in
this area here as part of the 9.3 development cycle. That may validate that
GPU acceleration is useful, and then the next step would be considering how
portable that will be to other GPU interfaces. I still expect CUDA will be
looked back on as a dead end for GPU accelerated computing one day.
Computing history is not filled with many single-vendor standards who
competed successfully against Intel providing the same thing. AMD's x86-64
is the only example I can think of where Intel didn't win that sort of race,
which happened (IMHO) only because Intel's Itanium failed to prioritize
backwards compatibility highly enough.

As a side node. My module (PG-Strom) also uses CUDA, although it tried to
implement it with OpenCL at begining of the project, because it didn't work
well when multiple sessions uses a GPU device concurrently.
The second background process get an error due to out-of-resources during
another process opens a GPU device.

I'm not clear whether it is a limitation of OpenCL, driver of Nvidia, or bugs of
my code. Anyway, I switched to CUDA, instead of the investigation on binary
drivers. :-(

Thanks,
--
KaiGai Kohei <kaigai@kaigai.gr.jp>

#30

Gaetano Mendola

mendola@gmail.com

almost 14 years ago

In reply to: Kohei KaiGai (#29)

Re: CUDA Sorting

On Feb 13, 2012 11:39 a.m., "Kohei KaiGai" <kaigai@kaigai.gr.jp> wrote:

2012/2/13 Greg Smith <greg@2ndquadrant.com>:

On 02/11/2012 08:14 PM, Gaetano Mendola wrote:

The trend is to have server capable of running CUDA providing GPU via
external hardware (PCI Express interface with PCI Express switches),

look

for example at PowerEdge C410x PCIe Expansion Chassis from DELL.

The C410X adds 16 PCIe slots to a server, housed inside a separate 3U
enclosure. That's a completely sensible purchase if your goal is to

build a

computing cluster, where a lot of work is handed off to a set of GPUs.

think that's even less likely to be a cost-effective option for a

database

server. Adding a single dedicated GPU installed in a server to

accelerate

sorting is something that might be justifiable, based on your

benchmarks.

This is a much more expensive option than that though. Details at
http://www.dell.com/us/enterprise/p/poweredge-c410x/pd for anyone who

wants

to see just how big this external box is.

I did some experimenst timing the sort done with CUDA and the sort done
with pg_qsort:
CUDA pg_qsort
33Milion integers: ~ 900 ms, ~ 6000 ms
1Milion integers: ~ 21 ms, ~ 162 ms
100k integers: ~ 2 ms, ~ 13 ms
CUDA time has already in the copy operations (host->device,

device->host).

As GPU I was using a C2050, and the CPU doing the pg_qsort was a

Intel(R)

Xeon(R) CPU X5650 @ 2.67GHz

That's really interesting, and the X5650 is by no means a slow CPU. So

this

benchmark is providing a lot of CPU power yet still seeing over a 6X

speedup

in sort times. It sounds like the PCI Express bus has gotten fast

enough

that the time to hand data over and get it back again can easily be
justified for medium to large sized sorts.

It would be helpful to take this patch and confirm whether it scales

when

using in parallel. Easiest way to do that would be to use the pgbench

"-f"

feature, which allows running an arbitrary number of some query at once.
Seeing whether this acceleration continued to hold as the number of

clients

increases is a useful data point.

Is it possible for you to break down where the time is being spent? For
example, how much of this time is consumed in the GPU itself, compared

time spent transferring data between CPU and GPU? I'm also curious

where

the bottleneck is at with this approach. If it's the speed of the

PCI-E bus

for smaller data sets, adding more GPUs may never be practical. If the

bus

can handle quite a few of these at once before it saturates, it might be
possible to overload a single GPU. That seems like it would be really

hard

to reach for database sorting though; I can't really defend justify my

gut

feel for that being true though.

I've never seen a PostgreSQL server capable of running CUDA, and I
don't expect that to change.

That sounds like:

"I think there is a world market for maybe five computers."
- IBM Chairman Thomas Watson, 1943

Yes, and "640K will be enough for everyone", ha ha. (Having said the

640K

thing is flat out denied by Gates, BTW, and no one has come up with

proof

otherwise).

I think you've made an interesting case for this sort of acceleration

now

being useful for systems doing what's typically considered a data

warehouse

task. I regularly see servers waiting for far more than 13M integers to
sort. And I am seeing a clear trend toward providing more PCI-E slots

servers now. Dell's R810 is the most popular single server model my
customers have deployed in the last year, and it has 5 X8 slots in it.

It's

rare all 5 of those are filled. As long as a dedicated GPU works fine

when

dropped to X8 speeds, I know a fair number of systems where one of those
could be added now.

There's another data point in your favor I didn't notice before your

last

e-mail. Amazon has a "Cluster GPU Quadruple Extra Large" node type that
runs with NVIDIA Tesla hardware. That means the installed base of

people

who could consider CUDA is higher than I expected. To demonstrate how

much

that costs, to provision a GPU enabled reserved instance from Amazon

for one

year costs $2410 at "Light Utilization", giving a system with 22GB of

RAM

and 1.69GB of storage. (I find the reserved prices easier to compare

with

dedicated hardware than the hourly ones) That's halfway between the
High-Memory Double Extra Large Instance (34GB RAM/850GB disk) at $1100

and

the High-Memory Quadruple Extra Large Instance (64GB RAM/1690GB disk) at
$2200. If someone could prove sorting was a bottleneck on their server,
that isn't an unreasonable option to consider on a cloud-based database
deployment.

I still think that an approach based on OpenCL is more likely to be

suitable

for PostgreSQL, which was part of why I gave CUDA low odds here. The

points

in favor of OpenCL are:

-Since you last posted, OpenCL compiling has switched to using LLVM as

their

standard compiler. Good PostgreSQL support for LLVM isn't far away. It
looks to me like the compiler situation for CUDA requires their

PathScale

based compiler. I don't know enough about this area to say which

compiling

tool chain will end up being easier to deal with.

-Intel is making GPU support standard for OpenCL, as I mentioned before.
NVIDIA will be hard pressed to compete with Intel for GPU acceleration

once

more systems supporting that enter the market.

-Easy availability of OpenCL on Mac OS X for development sake. Lots of
Postgres hackers with OS X systems, even though there aren't too many

OS X

database servers.

The fact that Amazon provides a way to crack the chicken/egg hardware
problem immediately helps a lot though, I don't even need a physical

card

here to test CUDA GPU acceleration on Linux now. With that data point,

your

benchmarks are good enough to say I'd be willing to help review a patch

this area here as part of the 9.3 development cycle. That may validate

that

GPU acceleration is useful, and then the next step would be considering

how

portable that will be to other GPU interfaces. I still expect CUDA

will be

looked back on as a dead end for GPU accelerated computing one day.
Computing history is not filled with many single-vendor standards who
competed successfully against Intel providing the same thing. AMD's

x86-64

is the only example I can think of where Intel didn't win that sort of

race,

which happened (IMHO) only because Intel's Itanium failed to prioritize
backwards compatibility highly enough.

As a side node. My module (PG-Strom) also uses CUDA, although it tried to
implement it with OpenCL at begining of the project, because it didn't

work

well when multiple sessions uses a GPU device concurrently.
The second background process get an error due to out-of-resources during
another process opens a GPU device.

I'm not clear whether it is a limitation of OpenCL, driver of Nvidia, or

bugs of

my code. Anyway, I switched to CUDA, instead of the investigation on

binary

drivers. :-(

Thanks,
--
KaiGai Kohei <kaigai@kaigai.gr.jp>

I have no experience with opencl but for sure with Cuda4.1 you can share
the same device from multiple host thread, as in for example allocate
memory in one host thread and use it in another thread. May be with opencl
you were facing the very same limit.

#31

Greg Stark

stark@mit.edu

almost 14 years ago

In reply to: Gaetano Mendola (#30)

Re: CUDA Sorting

I don't think we should be looking at either CUDA or OpenCL directly.
We should be looking for a generic library that can target either and
is well maintained and actively developed. Any GPU code we write
ourselves would rapidly be overtaken by changes in the hardware and
innovations in parallel algorithms. If we find a library that provides
a sorting api and adapt our code to use it then we'll get the benefits
of any new hardware feature as the library adds support for them.

#32

Gaetano Mendola

mendola@gmail.com

almost 14 years ago

In reply to: Greg Stark (#31)

Re: CUDA Sorting

On Feb 13, 2012 7:49 p.m., "Greg Stark" <stark@mit.edu> wrote:

I don't think we should be looking at either CUDA or OpenCL directly.
We should be looking for a generic library that can target either and
is well maintained and actively developed. Any GPU code we write
ourselves would rapidly be overtaken by changes in the hardware and
innovations in parallel algorithms. If we find a library that provides
a sorting api and adapt our code to use it then we'll get the benefits
of any new hardware feature as the library adds support for them.

To sort integer I used the Thrust Nvidia library.

#33

Gaetano Mendola

mendola@gmail.com

almost 14 years ago

In reply to: Greg Smith (#28)

Re: CUDA Sorting

On 13/02/2012 08:26, Greg Smith wrote:

On 02/11/2012 08:14 PM, Gaetano Mendola wrote:

The trend is to have server capable of running CUDA providing GPU via
external hardware (PCI Express interface with PCI Express switches),
look for example at PowerEdge C410x PCIe Expansion Chassis from DELL.

The C410X adds 16 PCIe slots to a server, housed inside a separate 3U
enclosure. That's a completely sensible purchase if your goal is to
build a computing cluster, where a lot of work is handed off to a set of
GPUs. I think that's even less likely to be a cost-effective option for
a database server. Adding a single dedicated GPU installed in a server
to accelerate sorting is something that might be justifiable, based on
your benchmarks. This is a much more expensive option than that though.
Details at http://www.dell.com/us/enterprise/p/poweredge-c410x/pd for
anyone who wants to see just how big this external box is.

I did some experimenst timing the sort done with CUDA and the sort
done with pg_qsort:
CUDA pg_qsort
33Milion integers: ~ 900 ms, ~ 6000 ms
1Milion integers: ~ 21 ms, ~ 162 ms
100k integers: ~ 2 ms, ~ 13 ms
CUDA time has already in the copy operations (host->device,
device->host).
As GPU I was using a C2050, and the CPU doing the pg_qsort was a
Intel(R) Xeon(R) CPU X5650 @ 2.67GHz

That's really interesting, and the X5650 is by no means a slow CPU. So
this benchmark is providing a lot of CPU power yet still seeing over a
6X speedup in sort times. It sounds like the PCI Express bus has gotten
fast enough that the time to hand data over and get it back again can
easily be justified for medium to large sized sorts.

It would be helpful to take this patch and confirm whether it scales
when using in parallel. Easiest way to do that would be to use the
pgbench "-f" feature, which allows running an arbitrary number of some
query at once. Seeing whether this acceleration continued to hold as the
number of clients increases is a useful data point.

Is it possible for you to break down where the time is being spent? For
example, how much of this time is consumed in the GPU itself, compared
to time spent transferring data between CPU and GPU? I'm also curious
where the bottleneck is at with this approach. If it's the speed of the
PCI-E bus for smaller data sets, adding more GPUs may never be
practical. If the bus can handle quite a few of these at once before it
saturates, it might be possible to overload a single GPU. That seems
like it would be really hard to reach for database sorting though; I
can't really defend justify my gut feel for that being true though.

There you go (times are in ms):

Size H->D SORT D->H TOTAL
64 0.209824 0.479392 0.013856 0.703072
128 0.098144 0.41744 0.01312 0.528704
256 0.096832 0.420352 0.013696 0.53088
512 0.097568 0.3952 0.014464 0.507232
1024 0.09872 0.396608 0.014624 0.509952
2048 0.101344 0.56224 0.016896 0.68048
4096 0.106176 0.562976 0.02016 0.689312
8192 0.116512 0.571264 0.02672 0.714496
16384 0.136096 0.587584 0.040192 0.763872
32768 0.179296 0.658112 0.066304 0.903712
65536 0.212352 0.84816 0.118016 1.178528
131072 0.317056 1.1465 0.22784 1.691396
262144 0.529376 1.82237 0.42512 2.776866
524288 0.724032 2.39834 0.64576 3.768132
1048576 1.11162 3.51978 1.12176 5.75316
2097152 1.95939 5.93434 2.06992 9.96365
4194304 3.76192 10.6011 4.10614 18.46916
8388608 7.16845 19.9245 7.93741 35.03036
16777216 13.8693 38.7413 15.4073 68.0179
33554432 27.3017 75.6418 30.6646 133.6081
67108864 54.2171 151.192 60.327 265.7361

pg_sort

64 0.010000
128 0.010000
256 0.021000
512 0.128000
1024 0.092000
2048 0.196000
4096 0.415000
8192 0.883000
16384 1.881000
32768 3.960000
65536 8.432000
131072 17.951000
262144 37.140000
524288 78.320000
1048576 163.276000
2097152 339.118000
4194304 693.223000
8388608 1423.142000
16777216 2891.218000
33554432 5910.851000
67108864 11980.930000

As you can notice the times with CUDA are lower than the timing I have
reported on my previous post because the server was doing something else
in mean while, I have repeated those benchmarks with server completely
unused.

And this is the boost as in pg_sort/cuda :

64 0.0142232943
128 0.018914175
256 0.039556962
512 0.2070058671
1024 0.1804091365
2048 0.2880319774
4096 0.6078524674
8192 1.2372357578
16384 2.4637635625
32768 4.4106972133
65536 7.1742037525
131072 10.5090706139
262144 13.3719091955
524288 20.5834084369
1048576 28.2516043357
2097152 33.9618513296
4194304 37.5247168794
8388608 40.5135716561
16777216 42.4743633661
33554432 44.2394809896
67108864 45.1499777411

I've never seen a PostgreSQL server capable of running CUDA, and I
don't expect that to change.

That sounds like:

"I think there is a world market for maybe five computers."
- IBM Chairman Thomas Watson, 1943

Yes, and "640K will be enough for everyone", ha ha. (Having said the
640K thing is flat out denied by Gates, BTW, and no one has come up with
proof otherwise).

I think you've made an interesting case for this sort of acceleration
now being useful for systems doing what's typically considered a data
warehouse task. I regularly see servers waiting for far more than 13M
integers to sort. And I am seeing a clear trend toward providing more
PCI-E slots in servers now. Dell's R810 is the most popular single
server model my customers have deployed in the last year, and it has 5
X8 slots in it. It's rare all 5 of those are filled. As long as a
dedicated GPU works fine when dropped to X8 speeds, I know a fair number
of systems where one of those could be added now.

There's another data point in your favor I didn't notice before your
last e-mail. Amazon has a "Cluster GPU Quadruple Extra Large" node type
that runs with NVIDIA Tesla hardware. That means the installed base of
people who could consider CUDA is higher than I expected. To demonstrate
how much that costs, to provision a GPU enabled reserved instance from
Amazon for one year costs $2410 at "Light Utilization", giving a system
with 22GB of RAM and 1.69GB of storage. (I find the reserved prices
easier to compare with dedicated hardware than the hourly ones) That's
halfway between the High-Memory Double Extra Large Instance (34GB
RAM/850GB disk) at $1100 and the High-Memory Quadruple Extra Large
Instance (64GB RAM/1690GB disk) at $2200. If someone could prove sorting
was a bottleneck on their server, that isn't an unreasonable option to
consider on a cloud-based database deployment.

I still think that an approach based on OpenCL is more likely to be
suitable for PostgreSQL, which was part of why I gave CUDA low odds
here. The points in favor of OpenCL are:

-Since you last posted, OpenCL compiling has switched to using LLVM as
their standard compiler. Good PostgreSQL support for LLVM isn't far
away. It looks to me like the compiler situation for CUDA requires their
PathScale based compiler. I don't know enough about this area to say
which compiling tool chain will end up being easier to deal with.

NVidia compiler named nvcc switched to LLVM as well (CUDA4.1).

-Intel is making GPU support standard for OpenCL, as I mentioned before.
NVIDIA will be hard pressed to compete with Intel for GPU acceleration
once more systems supporting that enter the market.

-Easy availability of OpenCL on Mac OS X for development sake. Lots of
Postgres hackers with OS X systems, even though there aren't too many OS
X database servers.
The fact that Amazon provides a way to crack the chicken/egg hardware
problem immediately helps a lot though, I don't even need a physical
card here to test CUDA GPU acceleration on Linux now. With that data
point, your benchmarks are good enough to say I'd be willing to help
review a patch in this area here as part of the 9.3 development cycle.
That may validate that GPU acceleration is useful, and then the next
step would be considering how portable that will be to other GPU
interfaces. I still expect CUDA will be looked back on as a dead end for
GPU accelerated computing one day. Computing history is not filled with
many single-vendor standards who competed successfully against Intel
providing the same thing. AMD's x86-64 is the only example I can think
of where Intel didn't win that sort of race, which happened (IMHO) only
because Intel's Itanium failed to prioritize backwards compatibility
highly enough.

I think that due the fact NVIDA nvcc uses LLVM now it means that soon we
will be able to compile "CUDA" programs for any target architecture
supported by LLVM.

Regards
Gaetano Mendola

#34

Marti Raudsepp

marti@juffo.org

almost 14 years ago

In reply to: Greg Stark (#31)

Re: CUDA Sorting

On Mon, Feb 13, 2012 at 20:48, Greg Stark <stark@mit.edu> wrote:

I don't think we should be looking at either CUDA or OpenCL directly.
We should be looking for a generic library that can target either and
is well maintained and actively developed.

I understand your point about using some external library for the
primitives, but I don't see why it needs to support both CUDA and
OpenCL. Libraries for GPU-accelerated primitives generally target
OpenCL *or* CUDA, not both.

As far as I understand (and someone correct me if I'm wrong), the
difference between them is mostly the API and the fact that CUDA had a
head start, and thus a larger developer community around it. (All the
early adopters went to CUDA)

But OpenCL already acts as an abstraction layer. CUDA is
NVIDIA-specific, but OpenCL is supported by AMD, Intel as well as
NVIDIA. It's pretty rare for servers to have separate graphics cards,
but recent Intel and AMD CPUs already have a GPU included on die,
which is another bonus for OpenCL.

So I'd say, the way things are heading, it's only a matter of time
before OpenCL takes over and there will be little reason to look back.

Regards,
Marti

#35

Gaetano Mendola

mendola@gmail.com

almost 14 years ago

In reply to: Greg Stark (#31)

Re: CUDA Sorting

On 13/02/2012 19:48, Greg Stark wrote:

I don't think we should be looking at either CUDA or OpenCL directly.
We should be looking for a generic library that can target either and
is well maintained and actively developed. Any GPU code we write
ourselves would rapidly be overtaken by changes in the hardware and
innovations in parallel algorithms. If we find a library that provides
a sorting api and adapt our code to use it then we'll get the benefits
of any new hardware feature as the library adds support for them.

I think one option is to make the sort function pluggable with a shared
library/dll. I see several benefits from this:

- It could be in the interest of the hardware vendor to provide the
most powerful sort implementation (I'm sure for example that TBB sort
implementation is faster that pg_sort)

- It can permit people to "play" with it without being deep involved
in pg development and stuffs.

- It can relieve the postgres core group the choose about the right
language/tool/implementation to use.

- Also for people not willing (or not able for the matter) to upgrade
postgres engine to change instead the sort function upon an hardware
upgrade.

Of course if this happens postgres engine has to make some sort of
sanity check (that the function for example actually sorts) before to
"thrust" the plugged sort.
The engine can even have multiple sort implementation available and
use the most proficient one (imagine some sorts acts better on
a certain range value or on certain element size).

Regards
Gaetano Mendola

#36

Gaetano Mendola

mendola@gmail.com

almost 14 years ago

In reply to: Greg Stark (#31)

Re: CUDA Sorting

On 13/02/2012 19:48, Greg Stark wrote:

I don't think we should be looking at either CUDA or OpenCL directly.
We should be looking for a generic library that can target either and
is well maintained and actively developed. Any GPU code we write
ourselves would rapidly be overtaken by changes in the hardware and
innovations in parallel algorithms. If we find a library that provides
a sorting api and adapt our code to use it then we'll get the benefits
of any new hardware feature as the library adds support for them.

I think one option is to make the sort function plugable with a shared
library/dll. I see several benefits from this:

- It could be in the interest of the hardware vendor to provide the
most powerful sort implementation (I'm sure for example that TBB sort
implementation is faster that pg_sort)

- It can permit people to "play" with it without being deep involved
in pg development and stuffs.

- It can relieve the postgres core group the choose about the right
language/tool/implementation to use.

- Also for people not willing (or not able for the matter) to upgrade
postgres engine to change instead the sort function upon an hardware
upgrade.

Regards
Gaetano Mendola

#37

Peter Geoghegan

peter@2ndquadrant.com

almost 14 years ago

In reply to: Gaetano Mendola (#35)

Re: CUDA Sorting

On 15 February 2012 20:00, Gaetano Mendola <mendola@gmail.com> wrote:

On 13/02/2012 19:48, Greg Stark wrote:

I don't think we should be looking at either CUDA or OpenCL directly.
We should be looking for a generic library that can target either and
is well maintained and actively developed. Any GPU code we write
ourselves would rapidly be overtaken by changes in the hardware and
innovations in parallel algorithms. If we find a library that provides
a sorting api and adapt our code to use it then we'll get the benefits
of any new hardware feature as the library adds support for them.

I think one option is to make the sort function pluggable with a shared
library/dll. I see several benefits from this:

- It could be in the interest of the hardware vendor to provide the most
powerful sort implementation (I'm sure for example that TBB sort
implementation is faster that pg_sort)

- It can permit people to "play" with it without being deep involved in pg
development and stuffs.

Sorry, but I find it really hard to believe that the non-availability
of pluggable sorting is what's holding people back here. Some vanguard
needs to go and prove the idea by building a rough prototype before we
can even really comment on what an API should look like. For example,
I am given to understand that GPUs generally sort using radix sort -
resolving the impedance mismatch that prevents someone from using a
non-comparison based sort sure sounds like a lot of work for an
entirely speculative reward.

Someone who cannot understand tuplesort, which is not all that
complicated, has no business trying to build GPU sorting into
Postgres.

I had a patch committed a few hours ago that almost included the
capability of assigning an alternative sorting function, but only one
with the exact same signature as my variant of qsort_arg. pg_qsort
isn't used to sort tuples at all, by the way.

Threading building blocks is not going to form the basis of any novel
sorting implementation, because comparators in general are not thread
safe, and it isn't available on all the platforms we support, and
because of how longjmp interacts with C++ stack unwinding and so on
and so on. Now, you could introduce some kind of parallelism into
sorting integers and floats, but that's an awful lot of work for a
marginal reward.

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

#38

Gaetano Mendola

mendola@gmail.com

almost 14 years ago

In reply to: Peter Geoghegan (#37)

Re: CUDA Sorting

On 15/02/2012 23:11, Peter Geoghegan wrote:

On 15 February 2012 20:00, Gaetano Mendola<mendola@gmail.com> wrote:

On 13/02/2012 19:48, Greg Stark wrote:

I don't think we should be looking at either CUDA or OpenCL directly.
We should be looking for a generic library that can target either and
is well maintained and actively developed. Any GPU code we write
ourselves would rapidly be overtaken by changes in the hardware and
innovations in parallel algorithms. If we find a library that provides
a sorting api and adapt our code to use it then we'll get the benefits
of any new hardware feature as the library adds support for them.

I think one option is to make the sort function pluggable with a shared
library/dll. I see several benefits from this:

- It could be in the interest of the hardware vendor to provide the most
powerful sort implementation (I'm sure for example that TBB sort
implementation is faster that pg_sort)

- It can permit people to "play" with it without being deep involved in pg
development and stuffs.

Sorry, but I find it really hard to believe that the non-availability
of pluggable sorting is what's holding people back here. Some vanguard
needs to go and prove the idea by building a rough prototype before we
can even really comment on what an API should look like. For example,
I am given to understand that GPUs generally sort using radix sort -
resolving the impedance mismatch that prevents someone from using a
non-comparison based sort sure sounds like a lot of work for an
entirely speculative reward.

AFAIK thrust library uses the radix sort if the keys you are sorting are
POD data comparable with a "<" operator otherwise it does the
comparison based sort using the operator provided.

http://docs.thrust.googlecode.com/hg/modules.html

I'm not saying that the non-availability of pluggable sort completely
holds people back, I'm saying that it will simplify the process now
and int the future, of course that's my opinion.

Someone who cannot understand tuplesort, which is not all that
complicated, has no business trying to build GPU sorting into
Postgres.

That sounds a bit harsh. I'm one of those indeed, I haven't look in the
details not having enough time for it. At work we do GPU computing (not
the sort type stuff) and given the fact I'm a Postgres enthusiast I
asked my self: "my server is able to sort around 500 milions integer per
seconds, if postgres was able to do that as well it would be very nice".

What I have to say? Sorry for my thoughts.

I had a patch committed a few hours ago that almost included the
capability of assigning an alternative sorting function, but only one
with the exact same signature as my variant of qsort_arg. pg_qsort
isn't used to sort tuples at all, by the way.

Then I did look in the wrong direction. Thank you for point that out.

Threading building blocks is not going to form the basis of any novel
sorting implementation, because comparators in general are not thread
safe, and it isn't available on all the platforms we support, and
because of how longjmp interacts with C++ stack unwinding and so on
and so on. Now, you could introduce some kind of parallelism into
sorting integers and floats, but that's an awful lot of work for a
marginal reward.

The TBB was just example that did come in my mind.
What do you mean with you could introduce some kind of parallelism?
As far as I know any algorithm using the divide and conquer can be
parallelized.

Regards
Gaetano Mendola

#39

Gaetano Mendola

mendola@gmail.com

almost 14 years ago

In reply to: Peter Geoghegan (#37)

Re: CUDA Sorting

On 15/02/2012 23:11, Peter Geoghegan wrote:

On 15 February 2012 20:00, Gaetano Mendola<mendola@gmail.com> wrote:

On 13/02/2012 19:48, Greg Stark wrote:

I don't think we should be looking at either CUDA or OpenCL directly.
We should be looking for a generic library that can target either and
is well maintained and actively developed. Any GPU code we write
ourselves would rapidly be overtaken by changes in the hardware and
innovations in parallel algorithms. If we find a library that provides
a sorting api and adapt our code to use it then we'll get the benefits
of any new hardware feature as the library adds support for them.

I think one option is to make the sort function pluggable with a shared
library/dll. I see several benefits from this:

- It could be in the interest of the hardware vendor to provide the most
powerful sort implementation (I'm sure for example that TBB sort
implementation is faster that pg_sort)

- It can permit people to "play" with it without being deep involved in pg
development and stuffs.

Sorry, but I find it really hard to believe that the non-availability
of pluggable sorting is what's holding people back here. Some vanguard
needs to go and prove the idea by building a rough prototype before we
can even really comment on what an API should look like. For example,
I am given to understand that GPUs generally sort using radix sort -
resolving the impedance mismatch that prevents someone from using a
non-comparison based sort sure sounds like a lot of work for an
entirely speculative reward.

AFAIK thrust library uses the radix sort if the keys you are sorting are
POD data comparable with a "<" operator otherwise it does the
comparison based sort using the operator provided.

http://docs.thrust.googlecode.com/hg/modules.html

I'm not saying that the non-availability of pluggable sort completely
holds people back, I'm saying that it will simplify the process now
and int the future, of course that's my opinion.

Someone who cannot understand tuplesort, which is not all that
complicated, has no business trying to build GPU sorting into
Postgres.

What I have to say? Sorry for my thoughts.

I had a patch committed a few hours ago that almost included the
capability of assigning an alternative sorting function, but only one
with the exact same signature as my variant of qsort_arg. pg_qsort
isn't used to sort tuples at all, by the way.

Then I did look in the wrong direction. Thank you for point that out.

Threading building blocks is not going to form the basis of any novel
sorting implementation, because comparators in general are not thread
safe, and it isn't available on all the platforms we support, and
because of how longjmp interacts with C++ stack unwinding and so on
and so on. Now, you could introduce some kind of parallelism into
sorting integers and floats, but that's an awful lot of work for a
marginal reward.

The TBB was just example that did come in my mind.
What do you mean with you could introduce some kind of parallelism?
As far as I know any algorithm using the divide and conquer can be
parallelized.

Regards
Gaetano Mendola

#40

Peter Geoghegan

peter@2ndquadrant.com

almost 14 years ago

In reply to: Gaetano Mendola (#38)

Re: CUDA Sorting

On 15 February 2012 22:54, Gaetano Mendola <mendola@gmail.com> wrote:

That sounds a bit harsh. I'm one of those indeed, I haven't look in the
details not having enough time for it. At work we do GPU computing (not
the sort type stuff) and given the fact I'm a Postgres enthusiast I
asked my self: "my server is able to sort around 500 milions integer per
seconds, if postgres was able to do that as well it would be very nice".

What I have to say? Sorry for my thoughts.

I'm not trying to sound harsh.

The only reason that my patch *nearly* had support for this was
because the implementation that we nearly went with would have only
needed another couple of lines of code to support it. It very probably
wouldn't have turned out to have been useful for any novel sorting
idea, and was really only intended to be used to support user-defined
full sorting specialisations. That didn't end up making the cut.

My point is that whatever is holding back the development of a useful
prototype here, it definitely isn't the lack of an existing API. We
don't know what such an API should look like, and just how invasive it
needs to be. More importantly, it remains to be seen how useful this
idea is in the real world - we don't have so much as a synthetic test
case with a single client, as far as I'm aware.

I'd encourage the OP to share his work on github or something along those lines.

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

#41

Dann Corbit

DCorbit@connx.com

almost 14 years ago

In reply to: Gaetano Mendola (#38)

Re: CUDA Sorting

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Gaetano Mendola
Sent: Wednesday, February 15, 2012 2:54 PM
To: Peter Geoghegan; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] CUDA Sorting

On 15/02/2012 23:11, Peter Geoghegan wrote:

On 15 February 2012 20:00, Gaetano Mendola<mendola@gmail.com> wrote:

On 13/02/2012 19:48, Greg Stark wrote:

I don't think we should be looking at either CUDA or OpenCL directly.
We should be looking for a generic library that can target either
and is well maintained and actively developed. Any GPU code we write
ourselves would rapidly be overtaken by changes in the hardware and
innovations in parallel algorithms. If we find a library that
provides a sorting api and adapt our code to use it then we'll get
the benefits of any new hardware feature as the library adds support for them.

I think one option is to make the sort function pluggable with a
shared library/dll. I see several benefits from this:

- It could be in the interest of the hardware vendor to provide the
most powerful sort implementation (I'm sure for example that TBB sort
implementation is faster that pg_sort)

- It can permit people to "play" with it without being deep
involved in pg development and stuffs.

Sorry, but I find it really hard to believe that the non-availability
of pluggable sorting is what's holding people back here. Some vanguard
needs to go and prove the idea by building a rough prototype before we
can even really comment on what an API should look like. For example,
I am given to understand that GPUs generally sort using radix sort -
resolving the impedance mismatch that prevents someone from using a
non-comparison based sort sure sounds like a lot of work for an
entirely speculative reward.

AFAIK thrust library uses the radix sort if the keys you are sorting are POD data comparable with a "<" operator otherwise it does the comparison based sort using the operator provided.

http://docs.thrust.googlecode.com/hg/modules.html

I'm not saying that the non-availability of pluggable sort completely holds people back, I'm saying that it will simplify the process now and int the future, of course that's my opinion.

Someone who cannot understand tuplesort, which is not all that
complicated, has no business trying to build GPU sorting into
Postgres.

That sounds a bit harsh. I'm one of those indeed, I haven't look in the details not having enough time for it. At work we do GPU computing (not the sort type stuff) and given the fact I'm a Postgres enthusiast I asked my self: "my server is able to sort around 500 milions integer per seconds, if postgres was able to do that as well it would be very nice".

What I have to say? Sorry for my thoughts.

I had a patch committed a few hours ago that almost included the
capability of assigning an alternative sorting function, but only one
with the exact same signature as my variant of qsort_arg. pg_qsort
isn't used to sort tuples at all, by the way.

Then I did look in the wrong direction. Thank you for point that out.

Threading building blocks is not going to form the basis of any novel
sorting implementation, because comparators in general are not thread
safe, and it isn't available on all the platforms we support, and
because of how longjmp interacts with C++ stack unwinding and so on
and so on. Now, you could introduce some kind of parallelism into
sorting integers and floats, but that's an awful lot of work for a
marginal reward.

The TBB was just example that did come in my mind.
What do you mean with you could introduce some kind of parallelism?
As far as I know any algorithm using the divide and conquer can be parallelized.

Radix sorting can be used for any data type, if you create a callback that provides the most significant bits in "width" buckets. At any rate, I can't imagine why anyone would want to complain about sorting 40 times faster than before, considering the amount of time database spend in ordering data.

I have a Cuda card in this machine (NVIDIA GeForce GTX 460) and I would not mind it a bit if my database "ORDER BY" clause suddenly started running ten times faster than before when I am dealing with a huge volume of data.

There have been other experiments along these lines such as:
GPU-based Sorting in PostgreSQL Naju Mancheril, School of Computer Science - Carnegie Mellon University
www.cs.virginia.edu/~skadron/Papers/bakkum_sqlite_gpgpu10.pdf (This is for SQLite, but the grammar of SQLite is almost a pure subset of PostgreSQL, including things like vacuum...)
http://wiki.postgresql.org/images/6/65/Pgopencl.pdf
http://dl.acm.org/citation.cfm?id=1807207
http://www.scribd.com/doc/51484335/PostgreSQL-OpenCL-Procedural-Language-pgEast-March-2011