Multi CPU Queries - Feedback and/or suggestions wanted!
Hi All,
we would like to start some work on improving the performance of
PostgreSQL in a multi-CPU environment. Dano Vojtek is student at the
Faculty of Mathematics and Physics of Charles university in Prague
(http://www.mff.cuni.cz) and he is going to cover this topic in his
master thesis. He is going to do some investigation in the methods and
write down the possibilities and then he is going to implement something
from that for PostgreSQL.
We want to come out with a serious proposal for this work after
collecting the feedback/opinions and doing the more serious investigation.
Topics that seem to be of interest and most of them were already
discussed at developers meeting in Ottawa are
1.) parallel sorts
2.) parallel query execution
3.) asynchronous I/O
4.) parallel COPY
5.) parallel pg_dump
6.) using threads for parallel processing
A scaling with increasing number of CPUs in 1.) and 2.) will face with
the I/O bottleneck at some point and the benefit gained here should be
nearly the same as for 3.) - the OS or disk could do a better job while
scheduling multiple reads from the disk for the same query at the same time.
1.)
More merges could be executed on different CPUs. However, one N-way
merge on one CPU is probably better than two N/2-way merges on 2 CPUs
while sharing the limit of work_mem together for these. This is specific
and separate from 2.) or 3.) and if something implemented here it could
probably share just the parallel infrastructure code.
========
2.)
Different subtrees (or nodes) of the plan could be executed in parallel
on different CPUs and the results of this subtrees could be requested
either synchronously or asynchronously.
========
3.)
The simplest possible way is to change the scan nodes that they will
send out the asynchronous I/O requests for the next blocks before they
manage to run out of tuples in the block they are going through. The
more advanced way would arise just by implementing 2.) which will then
lead to different scan nodes to be executed on different CPUs at the
same time.
========
4.) and 5.)
We do not want to focus here, since there are on-going projects already.
========
6.)
Currently, threads are not used in PostgreSQL (except some cases for
Windows OS). Generally using them would bring some problems
a) different thread implementations on different OSes
b) crash of the whole process if the problem happens in one thread.
Backends are isolated and the problem in one backend leads to the
graceful shut down of other backends.
c) synchronization problems
* a) seem just to be more for implementation. Is there any problem with
execution of more threads on any supported OS? Like some planning issue
that all the threads for the same process end up planned on the same
CPU? Or something similar?
* b) is fine with using more threads for processing the same query in
the same backend - if one crashes others could do the graceful shutdown.
* c) does not have to be solved in general because the work of
all the threads will be synchronized and we could expect pretty well
which data are being accessed by which thread. The memory allocation
have to be adjusted to be thread safe and should not affect the
performance (Is different memory context for different threads
sufficient?). Other common code might need some changes as well.
Possibly, the synchronization/critical section exclusion could be done
in executor and only if needed.
* Using processes instead of threads makes other things more complex
- sharing objects between processes might need much more coding
- more overhead during execution and synchronization
========
It seems to that it makes sense to start working on 2) and 3) and we
would like to think of using more threads for processing the same query
within one backend.
We appreciate feedback, comments and/or suggestions.
Cheers
Julo
On Mon, Oct 20, 2008 at 12:05 PM, Julius Stroffek
<Julius.Stroffek@sun.com>wrote:
Topics that seem to be of interest and most of them were already
discussed at developers meeting in Ottawa are
1.) parallel sorts
2.) parallel query execution
3.) asynchronous I/O
4.) parallel COPY
5.) parallel pg_dump
6.) using threads for parallel processing
[...]
2.)
Different subtrees (or nodes) of the plan could be executed in parallel
on different CPUs and the results of this subtrees could be requested
either synchronously or asynchronously.
I don't see why multiple CPUs can't work on the same node of a plan. For
instance, consider a node involving a scan with an expensive condition, like
UTF-8 string length. If you have four CPUs you can bring to bear, each CPU
could take every fourth page, computing the expensive condition for each
tuple in that page. The results of the scan can be retired asynchronously
to the next node above.
-jwb
There is a problem trying to make Postgres do these things in Parallel.
The backend code isn't thread-safe, so doing a multi-thread
implementation requires quite a bit of work.
Using multiple processes has its own problems: The whole way locking
works equates one process with one transaction (The proc table is one
entry per process). Processes would conflict on locks, deadlocking
themselves, as well as many other problems.
It's all a good idea, but the work is probably far more than you expect.
Async I/O might be easier, if you used pThreads, which is mostly
portable, but not to all platforms. (Yes, they do work on Windows)
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Jeffrey Baker
Sent: 2008-10-20 22:25
To: Julius Stroffek
Cc: pgsql-hackers@postgresql.org; Dano Vojtek
Subject: Re: [HACKERS] Multi CPU Queries - Feedback and/or suggestions
wanted!
On Mon, Oct 20, 2008 at 12:05 PM, Julius Stroffek
<Julius.Stroffek@sun.com> wrote:
Topics that seem to be of interest and most of them were already
discussed at developers meeting in Ottawa are
1.) parallel sorts
2.) parallel query execution
3.) asynchronous I/O
4.) parallel COPY
5.) parallel pg_dump
6.) using threads for parallel processing
[...]
2.)
Different subtrees (or nodes) of the plan could be executed in
parallel
on different CPUs and the results of this subtrees could be
requested
either synchronously or asynchronously.
I don't see why multiple CPUs can't work on the same node of a plan.
For instance, consider a node involving a scan with an expensive
condition, like UTF-8 string length. If you have four CPUs you can
bring to bear, each CPU could take every fourth page, computing the
expensive condition for each tuple in that page. The results of the
scan can be retired asynchronously to the next node above.
-jwb
Hi Jeffrey,
thank you for the suggestion. Yes, they potentially can, we'll consider
this.
Julo
Jeffrey Baker wrote:
Show quoted text
I don't see why multiple CPUs can't work on the same node of a plan.
For instance, consider a node involving a scan with an expensive
condition, like UTF-8 string length. If you have four CPUs you can
bring to bear, each CPU could take every fourth page, computing the
expensive condition for each tuple in that page. The results of the
scan can be retired asynchronously to the next node above.
-jwb
I can confirm that bringing Postgres code to multi-thread implementation
requires quite a bit of ground work. I have been working for a long
while
with a Postgres 7.* fork that uses pthreads rather than processes.
The effort
to make all the subsystems thread safe took some time and touched
almost every section of the codebase.
I recently spent some time trying to optimize for Chip Multi-Threading
systems but focused more on total throughput rather than single query
performance. The biggest wins came from changing some coarse
grained locks in the page buffering system to a finer grained
implementation.
I also tried to improve single query performance by splitting index and
sequential scans into two threads, one to fault in pages and check tuple
visibility and the other for everything else. My success was limited
and
it was hard for me to work the proper costing into the query optimizer
so
that it fired at the right times.
One place that multiple threads really helped was in index building.
My code is poorly commented and the build system is a mess (I am only
building 64bit SPARC for embedding into another app). However, I am
using it in production and source is available if it's of any help.
Myron Scott
On Oct 20, 2008, at 11:28 PM, Chuck McDevitt wrote:
Show quoted text
There is a problem trying to make Postgres do these things in
Parallel.The backend code isn’t thread-safe, so doing a multi-thread
implementation requires quite a bit of work.Using multiple processes has its own problems: The whole way
locking works equates one process with one transaction (The proc
table is one entry per process). Processes would conflict on locks,
deadlocking themselves, as well as many other problems.It’s all a good idea, but the work is probably far more than you
expect.Async I/O might be easier, if you used pThreads, which is mostly
portable, but not to all platforms. (Yes, they do work on Windows)From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org
] On Behalf Of Jeffrey Baker
Sent: 2008-10-20 22:25
To: Julius Stroffek
Cc: pgsql-hackers@postgresql.org; Dano Vojtek
Subject: Re: [HACKERS] Multi CPU Queries - Feedback and/or
suggestions wanted!On Mon, Oct 20, 2008 at 12:05 PM, Julius Stroffek <Julius.Stroffek@sun.com
wrote:
Topics that seem to be of interest and most of them were already
discussed at developers meeting in Ottawa are
1.) parallel sorts
2.) parallel query execution
3.) asynchronous I/O
4.) parallel COPY
5.) parallel pg_dump
6.) using threads for parallel processing
[...]
2.)
Different subtrees (or nodes) of the plan could be executed in
parallel
on different CPUs and the results of this subtrees could be requested
either synchronously or asynchronously.I don't see why multiple CPUs can't work on the same node of a
plan. For instance, consider a node involving a scan with an
expensive condition, like UTF-8 string length. If you have four
CPUs you can bring to bear, each CPU could take every fourth page,
computing the expensive condition for each tuple in that page. The
results of the scan can be retired asynchronously to the next node
above.-jwb
On Mon, 2008-10-20 at 21:05 +0200, Julius Stroffek wrote:
He is going to do some investigation in the methods and
write down the possibilities and then he is going to implement
something from that for PostgreSQL.
When will this work be complete? We are days away from completing main
work on 8.4, so you won't get much discussion on this for a few months
yet. Will it be complete in time for 8.5? Or much earlier even?
Julius, you don't mention what your role is in this. In what sense is
Dano's master's thesis a "we" thing?
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
Julius Stroffek wrote:
Hi All,
we would like to start some work on improving the performance of
PostgreSQL in a multi-CPU environment. Dano Vojtek is student at the
Faculty of Mathematics and Physics of Charles university in Prague
(http://www.mff.cuni.cz) and he is going to cover this topic in his
master thesis. He is going to do some investigation in the methods and
write down the possibilities and then he is going to implement something
from that for PostgreSQL.We want to come out with a serious proposal for this work after
collecting the feedback/opinions and doing the more serious investigation.
Exciting stuff, and clearly a direction we need to explore.
Topics that seem to be of interest and most of them were already
discussed at developers meeting in Ottawa are
1.) parallel sorts
2.) parallel query execution
3.) asynchronous I/O
I think the current plan is to use posix_advise() to allow parallel I/O,
rather than async I/O becuase posix_advise() will require fewer code
changes.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
On Thu, 23 Oct 2008, Bruce Momjian wrote:
I think the current plan is to use posix_advise() to allow parallel I/O,
rather than async I/O becuase posix_advise() will require fewer code
changes.
These are not necessarily mutually exclusive designs. fadvise works fine
on Linux, but as far as I know only async I/O works on Solaris. Linux
also has an async I/O library, and it's not clear to me yet whether that
might work even better than the fadvise approach.
--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Thu, Oct 23, 2008 at 4:53 PM, Greg Smith <gsmith@gregsmith.com> wrote:
I think the current plan is to use posix_advise() to allow parallel I/O,
rather than async I/O becuase posix_advise() will require fewer code
changes.These are not necessarily mutually exclusive designs. fadvise works fine on
Linux, but as far as I know only async I/O works on Solaris. Linux also has
an async I/O library, and it's not clear to me yet whether that might work
even better than the fadvise approach.
fadvise is a kludge. While it will help, it still makes us completely
reliant on the OS. For performance reasons, we should be supporting a
multi-block read directly into shared buffers. IIRC, we currently
have support for rings in the buffer pool, which we could read
directly into. Though, an LRU-based buffer manager design would be
more optimal in this case.
--
Jonah H. Harris, Senior DBA
myYearbook.com
Jonah H. Harris wrote:
On Thu, Oct 23, 2008 at 4:53 PM, Greg Smith <gsmith@gregsmith.com> wrote:
I think the current plan is to use posix_advise() to allow parallel I/O,
rather than async I/O becuase posix_advise() will require fewer code
changes.These are not necessarily mutually exclusive designs. fadvise works fine on
Linux, but as far as I know only async I/O works on Solaris. Linux also has
an async I/O library, and it's not clear to me yet whether that might work
even better than the fadvise approach.fadvise is a kludge. While it will help, it still makes us completely
reliant on the OS. For performance reasons, we should be supporting a
multi-block read directly into shared buffers. IIRC, we currently
have support for rings in the buffer pool, which we could read
directly into. Though, an LRU-based buffer manager design would be
more optimal in this case.
True, it is a kludge but if it gives us 95% of the benfit with 10% of
the code, it is a win.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
I couldn't get async I/O to work on Linux. That is it "worked" but
performed the same as reading one block at a time. On solaris the
situation is reversed.
In what way is fadvise a kludge?
greg
On 24 Oct 2008, at 01:44 AM, Bruce Momjian <bruce@momjian.us> wrote:
Show quoted text
Jonah H. Harris wrote:
On Thu, Oct 23, 2008 at 4:53 PM, Greg Smith <gsmith@gregsmith.com>
wrote:I think the current plan is to use posix_advise() to allow
parallel I/O,
rather than async I/O becuase posix_advise() will require fewer
code
changes.These are not necessarily mutually exclusive designs. fadvise
works fine on
Linux, but as far as I know only async I/O works on Solaris.
Linux also has
an async I/O library, and it's not clear to me yet whether that
might work
even better than the fadvise approach.fadvise is a kludge. While it will help, it still makes us
completely
reliant on the OS. For performance reasons, we should be
supporting a
multi-block read directly into shared buffers. IIRC, we currently
have support for rings in the buffer pool, which we could read
directly into. Though, an LRU-based buffer manager design would be
more optimal in this case.True, it is a kludge but if it gives us 95% of the benfit with 10% of
the code, it is a win.--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com+ If your life is a hard drive, Christ can be your backup. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Greg Stark wrote:
I couldn't get async I/O to work on Linux. That is it "worked" but
performed the same as reading one block at a time. On solaris the
situation is reversed.In what way is fadvise a kludge?
I think he is saying AIO gives us more flexibility, but I am unsure we
need it.
---------------------------------------------------------------------------
greg
On 24 Oct 2008, at 01:44 AM, Bruce Momjian <bruce@momjian.us> wrote:
Jonah H. Harris wrote:
On Thu, Oct 23, 2008 at 4:53 PM, Greg Smith <gsmith@gregsmith.com>
wrote:I think the current plan is to use posix_advise() to allow
parallel I/O,
rather than async I/O becuase posix_advise() will require fewer
code
changes.These are not necessarily mutually exclusive designs. fadvise
works fine on
Linux, but as far as I know only async I/O works on Solaris.
Linux also has
an async I/O library, and it's not clear to me yet whether that
might work
even better than the fadvise approach.fadvise is a kludge. While it will help, it still makes us
completely
reliant on the OS. For performance reasons, we should be
supporting a
multi-block read directly into shared buffers. IIRC, we currently
have support for rings in the buffer pool, which we could read
directly into. Though, an LRU-based buffer manager design would be
more optimal in this case.True, it is a kludge but if it gives us 95% of the benfit with 10% of
the code, it is a win.--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com+ If your life is a hard drive, Christ can be your backup. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
On Thu, Oct 23, 2008 at 8:44 PM, Bruce Momjian <bruce@momjian.us> wrote:
True, it is a kludge but if it gives us 95% of the benfit with 10% of
the code, it is a win.
I'd say, optimistically, maybe 30-45% the benefit over a proper
multi-block read using O_DIRECT.
--
Jonah H. Harris, Senior DBA
myYearbook.com
On Thu, Oct 23, 2008 at 10:36 PM, Greg Stark
<greg.stark@enterprisedb.com> wrote:
I couldn't get async I/O to work on Linux. That is it "worked" but performed
the same as reading one block at a time. On solaris the situation is
reversed.
Hmm, then obviously you did something wrong, because my tests showed
it quite well. Pull the source to iozone or fio.
In what way is fadvise a kludge?
non-portable, requires more user-to-system CPU, ... need I go on?
--
Jonah H. Harris, Senior DBA
myYearbook.com
"Jonah H. Harris" <jonah.harris@gmail.com> writes:
On Thu, Oct 23, 2008 at 10:36 PM, Greg Stark
In what way is fadvise a kludge?
non-portable, requires more user-to-system CPU, ... need I go on?
I'd be interested to know which of these proposals you claim *is*
portable. The single biggest reason to reject 'em all is that
they aren't.
regards, tom lane
On Fri, Oct 24, 2008 at 12:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
non-portable, requires more user-to-system CPU, ... need I go on?
I'd be interested to know which of these proposals you claim *is*
portable. The single biggest reason to reject 'em all is that
they aren't.
Yes, that was bad wording on my part. What I mean to say was
unpredictable. Different OSes and filesystems handle fadvise
differently (or not at all), which makes any claim to performance gain
configuration-dependent. My preferred method, using O_DIRECT and
fetching directly into shared buffers, is not without its issues or
challenges as well. However, by abstracting the multi-block read
interface, we could use more optimal calls depending on the OS.
Having done a bit of research and testing in this area (AIO and buffer
management), I don't see any easy solution. fadvise will work on some
systems and will likely give some gain on them, but won't work for
everyone. The alternative is to abstract prefetching and allow
platform-specific code, which we rarely do. While we could build an
abstract prefetch interface and simply use fadvise for it now (rather
than OS-specific code), I don't see an easy win in any case.
--
Jonah H. Harris, Senior DBA
myYearbook.com
On 24 Oct 2008, at 04:31 AM, "Jonah H. Harris"
<jonah.harris@gmail.com> wrote:
On Thu, Oct 23, 2008 at 10:36 PM, Greg Stark
<greg.stark@enterprisedb.com> wrote:I couldn't get async I/O to work on Linux. That is it "worked" but
performed
the same as reading one block at a time. On solaris the situation is
reversed.Hmm, then obviously you did something wrong, because my tests showed
it quite well. Pull the source to iozone or fio.
I posted the source, feel free to point out what I did wrong. It did
work on solaris with and without o_direct so I didn't think it was a
bug in my code.
In what way is fadvise a kludge?
non-portable, requires more user-to-system CPU, ... need I go on?
Well it's just as portable, they're both specified by posix. Actually
async I/o is in the real-time extensions so one could argue it's less
portable. Also before posix_fadvise there was plain old fadvise so
it's portable to older platforms too whereas async I/o isn't.
Posix_fadvise does require two syscalls and two trips to the buffer
manager. But that doesn't really make it a kludge if the resulting
code is cleaner than the async I/o code would be. To use async I/o we
would have to pin all the buffers we're reading which would be quite a
lot of code changes.
I did ask for feedback on precisely this point of whether two trips to
the buffer manager was a problem. It would have been nice to get the
feedback 6 months ago when I posted it instead of now two weeks before
feature freeze.
Based on what? I did test this and posted the data. The results I
posted showed that posix_fadvise on Linux performed nearly as well on
Linux as async I/O on Solaris on identical hardware.
More importantly it scaled with the number if drives. A 15 drive array
gets about 15x the performance of a 1 drive array if enough read-ahead
is done. Plus an extra boost if the input wasn't already sorted which
presumably reflects the better i/o ordering.
--
greg
On 24 Oct 2008, at 04:29 AM, "Jonah H. Harris"
<jonah.harris@gmail.com> wrote:
Show quoted text
On Thu, Oct 23, 2008 at 8:44 PM, Bruce Momjian <bruce@momjian.us>
wrote:True, it is a kludge but if it gives us 95% of the benfit with 10% of
the code, it is a win.I'd say, optimistically, maybe 30-45% the benefit over a proper
multi-block read using O_DIRECT.--
Jonah H. Harris, Senior DBA
myYearbook.com--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
We did discuss this in Ottawa and I beleive your comment was "the hint
is in the name" referring to posix_fadvise.
In any case both aio and posix_fadvise are specified by posix so I
don't see either as a problem on that front.
I don't think we can ignore any longer that we effectively can't use
raid arrays with postgres. If you have many concurrent queries or
restrict yourself to sequential scans you're ok but if you're doing
data warehousing you're going to be pretty disappointed to see your
shiny raid array performing like a single drive.
greg
On 24 Oct 2008, at 05:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Show quoted text
"Jonah H. Harris" <jonah.harris@gmail.com> writes:
On Thu, Oct 23, 2008 at 10:36 PM, Greg Stark
In what way is fadvise a kludge?
non-portable, requires more user-to-system CPU, ... need I go on?
I'd be interested to know which of these proposals you claim *is*
portable. The single biggest reason to reject 'em all is that
they aren't.regards, tom lane
Jonah H. Harris wrote:
fadvise is a kludge.
I don't think it's a kludge at all. posix_fadvise() is a pretty nice and
clean interface to hint the kernel what pages you're going to access in
the near future. I can't immediately come up with a cleaner interface to
do that.
Compared to async I/O, it's helluva lot simpler to add a few
posix_fadvise() calls to an application, than switch to a completely
different paradigm. And while posix_fadvise() is just a hint, allowing
the OS to prioritize accordingly, all async I/O requests look the same.
While it will help, it still makes us completely
reliant on the OS.
That's not a bad thing in my opinion. The OS knows the I/O hardware,
disk layout, utilization, and so forth, and is in a much better position
to do I/O scheduling than a user process. The only advantage a user
process has is that it knows better what pages it's going to need, and
posix_fadvise() is a good interface to let the user process tell the
kernel that.
IIRC, we currently have support for rings in the buffer pool, which we could read
directly into.
The rings won't help you a bit. It's just a different way to choose
victim buffers.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com