Multi CPU Queries - Feedback and/or suggestions wanted!
Hi All,
we would like to start some work on improving the performance of
PostgreSQL in a multi-CPU environment. Dano Vojtek is student at the
Faculty of Mathematics and Physics of Charles university in Prague
(http://www.mff.cuni.cz) and he is going to cover this topic in his
master thesis. He is going to do some investigation in the methods and
write down the possibilities and then he is going to implement something
from that for PostgreSQL.
We want to come out with a serious proposal for this work after
collecting the feedback/opinions and doing the more serious investigation.
Topics that seem to be of interest and most of them were already
discussed at developers meeting in Ottawa are
1.) parallel sorts
2.) parallel query execution
3.) asynchronous I/O
4.) parallel COPY
5.) parallel pg_dump
6.) using threads for parallel processing
A scaling with increasing number of CPUs in 1.) and 2.) will face with
the I/O bottleneck at some point and the benefit gained here should be
nearly the same as for 3.) - the OS or disk could do a better job while
scheduling multiple reads from the disk for the same query at the same time.
1.)
More merges could be executed on different CPUs. However, one N-way
merge on one CPU is probably better than two N/2-way merges on 2 CPUs
while sharing the limit of work_mem together for these. This is specific
and separate from 2.) or 3.) and if something implemented here it could
probably share just the parallel infrastructure code.
========
2.)
Different subtrees (or nodes) of the plan could be executed in parallel
on different CPUs and the results of this subtrees could be requested
either synchronously or asynchronously.
========
3.)
The simplest possible way is to change the scan nodes that they will
send out the asynchronous I/O requests for the next blocks before they
manage to run out of tuples in the block they are going through. The
more advanced way would arise just by implementing 2.) which will then
lead to different scan nodes to be executed on different CPUs at the
same time.
========
4.) and 5.)
We do not want to focus here, since there are on-going projects already.
========
6.)
Currently, threads are not used in PostgreSQL (except some cases for
Windows OS). Generally using them would bring some problems
a) different thread implementations on different OSes
b) crash of the whole process if the problem happens in one thread.
Backends are isolated and the problem in one backend leads to the
graceful shut down of other backends.
c) synchronization problems
* a) seem just to be more for implementation. Is there any problem with
execution of more threads on any supported OS? Like some planning issue
that all the threads for the same process end up planned on the same
CPU? Or something similar?
* b) is fine with using more threads for processing the same query in
the same backend - if one crashes others could do the graceful shutdown.
* c) does not have to be solved in general because the work of
all the threads will be synchronized and we could expect pretty well
which data are being accessed by which thread. The memory allocation
have to be adjusted to be thread safe and should not affect the
performance (Is different memory context for different threads
sufficient?). Other common code might need some changes as well.
Possibly, the synchronization/critical section exclusion could be done
in executor and only if needed.
* Using processes instead of threads makes other things more complex
- sharing objects between processes might need much more coding
- more overhead during execution and synchronization
========
It seems to that it makes sense to start working on 2) and 3) and we
would like to think of using more threads for processing the same query
within one backend.
We appreciate feedback, comments and/or suggestions.
Cheers
Julo
On Mon, Oct 20, 2008 at 12:05 PM, Julius Stroffek
<Julius.Stroffek@sun.com>wrote:
Topics that seem to be of interest and most of them were already
discussed at developers meeting in Ottawa are
1.) parallel sorts
2.) parallel query execution
3.) asynchronous I/O
4.) parallel COPY
5.) parallel pg_dump
6.) using threads for parallel processing
[...]
2.)
Different subtrees (or nodes) of the plan could be executed in parallel
on different CPUs and the results of this subtrees could be requested
either synchronously or asynchronously.
I don't see why multiple CPUs can't work on the same node of a plan. For
instance, consider a node involving a scan with an expensive condition, like
UTF-8 string length. If you have four CPUs you can bring to bear, each CPU
could take every fourth page, computing the expensive condition for each
tuple in that page. The results of the scan can be retired asynchronously
to the next node above.
-jwb
There is a problem trying to make Postgres do these things in Parallel.
The backend code isn't thread-safe, so doing a multi-thread
implementation requires quite a bit of work.
Using multiple processes has its own problems: The whole way locking
works equates one process with one transaction (The proc table is one
entry per process). Processes would conflict on locks, deadlocking
themselves, as well as many other problems.
It's all a good idea, but the work is probably far more than you expect.
Async I/O might be easier, if you used pThreads, which is mostly
portable, but not to all platforms. (Yes, they do work on Windows)
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Jeffrey Baker
Sent: 2008-10-20 22:25
To: Julius Stroffek
Cc: pgsql-hackers@postgresql.org; Dano Vojtek
Subject: Re: [HACKERS] Multi CPU Queries - Feedback and/or suggestions
wanted!
On Mon, Oct 20, 2008 at 12:05 PM, Julius Stroffek
<Julius.Stroffek@sun.com> wrote:
Topics that seem to be of interest and most of them were already
discussed at developers meeting in Ottawa are
1.) parallel sorts
2.) parallel query execution
3.) asynchronous I/O
4.) parallel COPY
5.) parallel pg_dump
6.) using threads for parallel processing
[...]
2.)
Different subtrees (or nodes) of the plan could be executed in
parallel
on different CPUs and the results of this subtrees could be
requested
either synchronously or asynchronously.
I don't see why multiple CPUs can't work on the same node of a plan.
For instance, consider a node involving a scan with an expensive
condition, like UTF-8 string length. If you have four CPUs you can
bring to bear, each CPU could take every fourth page, computing the
expensive condition for each tuple in that page. The results of the
scan can be retired asynchronously to the next node above.
-jwb
Hi Jeffrey,
thank you for the suggestion. Yes, they potentially can, we'll consider
this.
Julo
Jeffrey Baker wrote:
Show quoted text
I don't see why multiple CPUs can't work on the same node of a plan.
For instance, consider a node involving a scan with an expensive
condition, like UTF-8 string length. If you have four CPUs you can
bring to bear, each CPU could take every fourth page, computing the
expensive condition for each tuple in that page. The results of the
scan can be retired asynchronously to the next node above.
-jwb
I can confirm that bringing Postgres code to multi-thread implementation
requires quite a bit of ground work. I have been working for a long
while
with a Postgres 7.* fork that uses pthreads rather than processes.
The effort
to make all the subsystems thread safe took some time and touched
almost every section of the codebase.
I recently spent some time trying to optimize for Chip Multi-Threading
systems but focused more on total throughput rather than single query
performance. The biggest wins came from changing some coarse
grained locks in the page buffering system to a finer grained
implementation.
I also tried to improve single query performance by splitting index and
sequential scans into two threads, one to fault in pages and check tuple
visibility and the other for everything else. My success was limited
and
it was hard for me to work the proper costing into the query optimizer
so
that it fired at the right times.
One place that multiple threads really helped was in index building.
My code is poorly commented and the build system is a mess (I am only
building 64bit SPARC for embedding into another app). However, I am
using it in production and source is available if it's of any help.
Myron Scott
On Oct 20, 2008, at 11:28 PM, Chuck McDevitt wrote:
Show quoted text
There is a problem trying to make Postgres do these things in
Parallel.The backend code isn’t thread-safe, so doing a multi-thread
implementation requires quite a bit of work.Using multiple processes has its own problems: The whole way
locking works equates one process with one transaction (The proc
table is one entry per process). Processes would conflict on locks,
deadlocking themselves, as well as many other problems.It’s all a good idea, but the work is probably far more than you
expect.Async I/O might be easier, if you used pThreads, which is mostly
portable, but not to all platforms. (Yes, they do work on Windows)From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org
] On Behalf Of Jeffrey Baker
Sent: 2008-10-20 22:25
To: Julius Stroffek
Cc: pgsql-hackers@postgresql.org; Dano Vojtek
Subject: Re: [HACKERS] Multi CPU Queries - Feedback and/or
suggestions wanted!On Mon, Oct 20, 2008 at 12:05 PM, Julius Stroffek <Julius.Stroffek@sun.com
wrote:
Topics that seem to be of interest and most of them were already
discussed at developers meeting in Ottawa are
1.) parallel sorts
2.) parallel query execution
3.) asynchronous I/O
4.) parallel COPY
5.) parallel pg_dump
6.) using threads for parallel processing
[...]
2.)
Different subtrees (or nodes) of the plan could be executed in
parallel
on different CPUs and the results of this subtrees could be requested
either synchronously or asynchronously.I don't see why multiple CPUs can't work on the same node of a
plan. For instance, consider a node involving a scan with an
expensive condition, like UTF-8 string length. If you have four
CPUs you can bring to bear, each CPU could take every fourth page,
computing the expensive condition for each tuple in that page. The
results of the scan can be retired asynchronously to the next node
above.-jwb
On Mon, 2008-10-20 at 21:05 +0200, Julius Stroffek wrote:
He is going to do some investigation in the methods and
write down the possibilities and then he is going to implement
something from that for PostgreSQL.
When will this work be complete? We are days away from completing main
work on 8.4, so you won't get much discussion on this for a few months
yet. Will it be complete in time for 8.5? Or much earlier even?
Julius, you don't mention what your role is in this. In what sense is
Dano's master's thesis a "we" thing?
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
Julius Stroffek wrote:
Hi All,
we would like to start some work on improving the performance of
PostgreSQL in a multi-CPU environment. Dano Vojtek is student at the
Faculty of Mathematics and Physics of Charles university in Prague
(http://www.mff.cuni.cz) and he is going to cover this topic in his
master thesis. He is going to do some investigation in the methods and
write down the possibilities and then he is going to implement something
from that for PostgreSQL.We want to come out with a serious proposal for this work after
collecting the feedback/opinions and doing the more serious investigation.
Exciting stuff, and clearly a direction we need to explore.
Topics that seem to be of interest and most of them were already
discussed at developers meeting in Ottawa are
1.) parallel sorts
2.) parallel query execution
3.) asynchronous I/O
I think the current plan is to use posix_advise() to allow parallel I/O,
rather than async I/O becuase posix_advise() will require fewer code
changes.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
On Thu, 23 Oct 2008, Bruce Momjian wrote:
I think the current plan is to use posix_advise() to allow parallel I/O,
rather than async I/O becuase posix_advise() will require fewer code
changes.
These are not necessarily mutually exclusive designs. fadvise works fine
on Linux, but as far as I know only async I/O works on Solaris. Linux
also has an async I/O library, and it's not clear to me yet whether that
might work even better than the fadvise approach.
--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Thu, Oct 23, 2008 at 4:53 PM, Greg Smith <gsmith@gregsmith.com> wrote:
I think the current plan is to use posix_advise() to allow parallel I/O,
rather than async I/O becuase posix_advise() will require fewer code
changes.These are not necessarily mutually exclusive designs. fadvise works fine on
Linux, but as far as I know only async I/O works on Solaris. Linux also has
an async I/O library, and it's not clear to me yet whether that might work
even better than the fadvise approach.
fadvise is a kludge. While it will help, it still makes us completely
reliant on the OS. For performance reasons, we should be supporting a
multi-block read directly into shared buffers. IIRC, we currently
have support for rings in the buffer pool, which we could read
directly into. Though, an LRU-based buffer manager design would be
more optimal in this case.
--
Jonah H. Harris, Senior DBA
myYearbook.com
Jonah H. Harris wrote:
On Thu, Oct 23, 2008 at 4:53 PM, Greg Smith <gsmith@gregsmith.com> wrote:
I think the current plan is to use posix_advise() to allow parallel I/O,
rather than async I/O becuase posix_advise() will require fewer code
changes.These are not necessarily mutually exclusive designs. fadvise works fine on
Linux, but as far as I know only async I/O works on Solaris. Linux also has
an async I/O library, and it's not clear to me yet whether that might work
even better than the fadvise approach.fadvise is a kludge. While it will help, it still makes us completely
reliant on the OS. For performance reasons, we should be supporting a
multi-block read directly into shared buffers. IIRC, we currently
have support for rings in the buffer pool, which we could read
directly into. Though, an LRU-based buffer manager design would be
more optimal in this case.
True, it is a kludge but if it gives us 95% of the benfit with 10% of
the code, it is a win.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
I couldn't get async I/O to work on Linux. That is it "worked" but
performed the same as reading one block at a time. On solaris the
situation is reversed.
In what way is fadvise a kludge?
greg
On 24 Oct 2008, at 01:44 AM, Bruce Momjian <bruce@momjian.us> wrote:
Show quoted text
Jonah H. Harris wrote:
On Thu, Oct 23, 2008 at 4:53 PM, Greg Smith <gsmith@gregsmith.com>
wrote:I think the current plan is to use posix_advise() to allow
parallel I/O,
rather than async I/O becuase posix_advise() will require fewer
code
changes.These are not necessarily mutually exclusive designs. fadvise
works fine on
Linux, but as far as I know only async I/O works on Solaris.
Linux also has
an async I/O library, and it's not clear to me yet whether that
might work
even better than the fadvise approach.fadvise is a kludge. While it will help, it still makes us
completely
reliant on the OS. For performance reasons, we should be
supporting a
multi-block read directly into shared buffers. IIRC, we currently
have support for rings in the buffer pool, which we could read
directly into. Though, an LRU-based buffer manager design would be
more optimal in this case.True, it is a kludge but if it gives us 95% of the benfit with 10% of
the code, it is a win.--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com+ If your life is a hard drive, Christ can be your backup. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Greg Stark wrote:
I couldn't get async I/O to work on Linux. That is it "worked" but
performed the same as reading one block at a time. On solaris the
situation is reversed.In what way is fadvise a kludge?
I think he is saying AIO gives us more flexibility, but I am unsure we
need it.
---------------------------------------------------------------------------
greg
On 24 Oct 2008, at 01:44 AM, Bruce Momjian <bruce@momjian.us> wrote:
Jonah H. Harris wrote:
On Thu, Oct 23, 2008 at 4:53 PM, Greg Smith <gsmith@gregsmith.com>
wrote:I think the current plan is to use posix_advise() to allow
parallel I/O,
rather than async I/O becuase posix_advise() will require fewer
code
changes.These are not necessarily mutually exclusive designs. fadvise
works fine on
Linux, but as far as I know only async I/O works on Solaris.
Linux also has
an async I/O library, and it's not clear to me yet whether that
might work
even better than the fadvise approach.fadvise is a kludge. While it will help, it still makes us
completely
reliant on the OS. For performance reasons, we should be
supporting a
multi-block read directly into shared buffers. IIRC, we currently
have support for rings in the buffer pool, which we could read
directly into. Though, an LRU-based buffer manager design would be
more optimal in this case.True, it is a kludge but if it gives us 95% of the benfit with 10% of
the code, it is a win.--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com+ If your life is a hard drive, Christ can be your backup. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
On Thu, Oct 23, 2008 at 8:44 PM, Bruce Momjian <bruce@momjian.us> wrote:
True, it is a kludge but if it gives us 95% of the benfit with 10% of
the code, it is a win.
I'd say, optimistically, maybe 30-45% the benefit over a proper
multi-block read using O_DIRECT.
--
Jonah H. Harris, Senior DBA
myYearbook.com
On Thu, Oct 23, 2008 at 10:36 PM, Greg Stark
<greg.stark@enterprisedb.com> wrote:
I couldn't get async I/O to work on Linux. That is it "worked" but performed
the same as reading one block at a time. On solaris the situation is
reversed.
Hmm, then obviously you did something wrong, because my tests showed
it quite well. Pull the source to iozone or fio.
In what way is fadvise a kludge?
non-portable, requires more user-to-system CPU, ... need I go on?
--
Jonah H. Harris, Senior DBA
myYearbook.com
"Jonah H. Harris" <jonah.harris@gmail.com> writes:
On Thu, Oct 23, 2008 at 10:36 PM, Greg Stark
In what way is fadvise a kludge?
non-portable, requires more user-to-system CPU, ... need I go on?
I'd be interested to know which of these proposals you claim *is*
portable. The single biggest reason to reject 'em all is that
they aren't.
regards, tom lane
On Fri, Oct 24, 2008 at 12:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
non-portable, requires more user-to-system CPU, ... need I go on?
I'd be interested to know which of these proposals you claim *is*
portable. The single biggest reason to reject 'em all is that
they aren't.
Yes, that was bad wording on my part. What I mean to say was
unpredictable. Different OSes and filesystems handle fadvise
differently (or not at all), which makes any claim to performance gain
configuration-dependent. My preferred method, using O_DIRECT and
fetching directly into shared buffers, is not without its issues or
challenges as well. However, by abstracting the multi-block read
interface, we could use more optimal calls depending on the OS.
Having done a bit of research and testing in this area (AIO and buffer
management), I don't see any easy solution. fadvise will work on some
systems and will likely give some gain on them, but won't work for
everyone. The alternative is to abstract prefetching and allow
platform-specific code, which we rarely do. While we could build an
abstract prefetch interface and simply use fadvise for it now (rather
than OS-specific code), I don't see an easy win in any case.
--
Jonah H. Harris, Senior DBA
myYearbook.com
On 24 Oct 2008, at 04:31 AM, "Jonah H. Harris"
<jonah.harris@gmail.com> wrote:
On Thu, Oct 23, 2008 at 10:36 PM, Greg Stark
<greg.stark@enterprisedb.com> wrote:I couldn't get async I/O to work on Linux. That is it "worked" but
performed
the same as reading one block at a time. On solaris the situation is
reversed.Hmm, then obviously you did something wrong, because my tests showed
it quite well. Pull the source to iozone or fio.
I posted the source, feel free to point out what I did wrong. It did
work on solaris with and without o_direct so I didn't think it was a
bug in my code.
In what way is fadvise a kludge?
non-portable, requires more user-to-system CPU, ... need I go on?
Well it's just as portable, they're both specified by posix. Actually
async I/o is in the real-time extensions so one could argue it's less
portable. Also before posix_fadvise there was plain old fadvise so
it's portable to older platforms too whereas async I/o isn't.
Posix_fadvise does require two syscalls and two trips to the buffer
manager. But that doesn't really make it a kludge if the resulting
code is cleaner than the async I/o code would be. To use async I/o we
would have to pin all the buffers we're reading which would be quite a
lot of code changes.
I did ask for feedback on precisely this point of whether two trips to
the buffer manager was a problem. It would have been nice to get the
feedback 6 months ago when I posted it instead of now two weeks before
feature freeze.
Based on what? I did test this and posted the data. The results I
posted showed that posix_fadvise on Linux performed nearly as well on
Linux as async I/O on Solaris on identical hardware.
More importantly it scaled with the number if drives. A 15 drive array
gets about 15x the performance of a 1 drive array if enough read-ahead
is done. Plus an extra boost if the input wasn't already sorted which
presumably reflects the better i/o ordering.
--
greg
On 24 Oct 2008, at 04:29 AM, "Jonah H. Harris"
<jonah.harris@gmail.com> wrote:
Show quoted text
On Thu, Oct 23, 2008 at 8:44 PM, Bruce Momjian <bruce@momjian.us>
wrote:True, it is a kludge but if it gives us 95% of the benfit with 10% of
the code, it is a win.I'd say, optimistically, maybe 30-45% the benefit over a proper
multi-block read using O_DIRECT.--
Jonah H. Harris, Senior DBA
myYearbook.com--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
We did discuss this in Ottawa and I beleive your comment was "the hint
is in the name" referring to posix_fadvise.
In any case both aio and posix_fadvise are specified by posix so I
don't see either as a problem on that front.
I don't think we can ignore any longer that we effectively can't use
raid arrays with postgres. If you have many concurrent queries or
restrict yourself to sequential scans you're ok but if you're doing
data warehousing you're going to be pretty disappointed to see your
shiny raid array performing like a single drive.
greg
On 24 Oct 2008, at 05:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Show quoted text
"Jonah H. Harris" <jonah.harris@gmail.com> writes:
On Thu, Oct 23, 2008 at 10:36 PM, Greg Stark
In what way is fadvise a kludge?
non-portable, requires more user-to-system CPU, ... need I go on?
I'd be interested to know which of these proposals you claim *is*
portable. The single biggest reason to reject 'em all is that
they aren't.regards, tom lane
Jonah H. Harris wrote:
fadvise is a kludge.
I don't think it's a kludge at all. posix_fadvise() is a pretty nice and
clean interface to hint the kernel what pages you're going to access in
the near future. I can't immediately come up with a cleaner interface to
do that.
Compared to async I/O, it's helluva lot simpler to add a few
posix_fadvise() calls to an application, than switch to a completely
different paradigm. And while posix_fadvise() is just a hint, allowing
the OS to prioritize accordingly, all async I/O requests look the same.
While it will help, it still makes us completely
reliant on the OS.
That's not a bad thing in my opinion. The OS knows the I/O hardware,
disk layout, utilization, and so forth, and is in a much better position
to do I/O scheduling than a user process. The only advantage a user
process has is that it knows better what pages it's going to need, and
posix_fadvise() is a good interface to let the user process tell the
kernel that.
IIRC, we currently have support for rings in the buffer pool, which we could read
directly into.
The rings won't help you a bit. It's just a different way to choose
victim buffers.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Jonah H. Harris wrote:
On Thu, Oct 23, 2008 at 8:44 PM, Bruce Momjian <bruce@momjian.us> wrote:
True, it is a kludge but if it gives us 95% of the benfit with 10% of
the code, it is a win.I'd say, optimistically, maybe 30-45% the benefit over a proper
multi-block read using O_DIRECT.
Let's try to focus. We're not talking about using O_DIRECT, we're
talking about using asynchronous I/O or posix_fadvise(). And without
more details on what you mean by benefit, under what circumstances, any
numbers like that is just unmeasurable handwaving.
In terms of getting the RAID array busy, in Greg's tests posix_fadvise()
on Linux worked just as well as async I/O works on Solaris. So it
doesn't seem like there's any inherent performance advantage in the
async I/O interface over posix_fadvise() + read(),
There is differences between different OS implementations of the
interfaces. But we're developing software for the future, and for a wide
range of platforms, and I'm sure operating systems will develop as well.
The decision should not be made on what is the fastest interface on a
given operating system in 2008.
Async I/O might have a small potential edge on CPU usage, because less
system calls are needed. However, let me remind you all that we're
talking about how to utilize RAID array to do physical, random, I/O as
fast as possible. IOW, the bottleneck is I/O, by definition. The CPU
efficiency of the kernel interface to initiate the I/O is insignificant,
until we reach a large enough random read throughput to saturate the
CPU, and even then there's probably more significant CPU savings to be
made elsewhere.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Fri, 2008-10-24 at 00:52 -0400, Jonah H. Harris wrote:
While we could build an
abstract prefetch interface and simply use fadvise for it now (rather
than OS-specific code), I don't see an easy win in any case.
When building an abstract interface, always use at least two
implementations (I guess that would be fadvise on linux and AIO on
solaris in this case). You are much more likely to get the interface
right this way.
--------------
Hannu
On Fri, Oct 24, 2008 at 7:59 AM, Hannu Krosing <hannu@2ndquadrant.com> wrote:
On Fri, 2008-10-24 at 00:52 -0400, Jonah H. Harris wrote:
While we could build an
abstract prefetch interface and simply use fadvise for it now (rather
than OS-specific code), I don't see an easy win in any case.When building an abstract interface, always use at least two
implementations (I guess that would be fadvise on linux and AIO on
solaris in this case). You are much more likely to get the interface
right this way.
I agree, I just wasn't sure as to whether Greg's patch supported both methods.
--
Jonah H. Harris, Senior DBA
myYearbook.com
Jonah H. Harris wrote:
On Fri, Oct 24, 2008 at 7:59 AM, Hannu Krosing <hannu@2ndquadrant.com> wrote:
On Fri, 2008-10-24 at 00:52 -0400, Jonah H. Harris wrote:
While we could build an
abstract prefetch interface and simply use fadvise for it now (rather
than OS-specific code), I don't see an easy win in any case.When building an abstract interface, always use at least two
implementations (I guess that would be fadvise on linux and AIO on
solaris in this case). You are much more likely to get the interface
right this way.I agree, I just wasn't sure as to whether Greg's patch supported both methods.
It does not, and probably will not for the near future; we can only
hope Solaris suports posix_fadvise() at some point.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
I thought about how to support both and ran into probblems that would
make the resulting solutions quite complex.
In the libaio view of the world you initiate io and either get a
callback or call another syscall to test if it's complete. Either
approach has problems for postgres. If the process that initiated io
is in the middle of a long query it might takr a long time ot even
never get back to complete the io. The callbacks use threads...
And polling for completion has the problem that another process could
be waiting on the io and can't issue a read as long as the first
process has the buffer locked and io in progress. I think aio makes a
lot more sense if you're using threads so you can start a thread to
wait for the io to complete.
Actually I think it might be doable with a lot of work but I'm worried
that it would be a lot of extra complexity even when you're not using
it. The current patch doesn't change anything when you're not using it
and actually is quite simple.
greg
On 24 Oct 2008, at 03:18 PM, Bruce Momjian <bruce@momjian.us> wrote:
Show quoted text
Jonah H. Harris wrote:
On Fri, Oct 24, 2008 at 7:59 AM, Hannu Krosing
<hannu@2ndquadrant.com> wrote:On Fri, 2008-10-24 at 00:52 -0400, Jonah H. Harris wrote:
While we could build an
abstract prefetch interface and simply use fadvise for it now
(rather
than OS-specific code), I don't see an easy win in any case.When building an abstract interface, always use at least two
implementations (I guess that would be fadvise on linux and AIO on
solaris in this case). You are much more likely to get the interface
right this way.I agree, I just wasn't sure as to whether Greg's patch supported
both methods.It does not, and probably will not for the near future; we can only
hope Solaris suports posix_fadvise() at some point.--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com+ If your life is a hard drive, Christ can be your backup. +
Also keep in mind that solaris is open source these days. If someone
wants it they could always go ahead and add the feature ...
greg
On 24 Oct 2008, at 03:18 PM, Bruce Momjian <bruce@momjian.us> wrote:
Show quoted text
Jonah H. Harris wrote:
On Fri, Oct 24, 2008 at 7:59 AM, Hannu Krosing
<hannu@2ndquadrant.com> wrote:On Fri, 2008-10-24 at 00:52 -0400, Jonah H. Harris wrote:
While we could build an
abstract prefetch interface and simply use fadvise for it now
(rather
than OS-specific code), I don't see an easy win in any case.When building an abstract interface, always use at least two
implementations (I guess that would be fadvise on linux and AIO on
solaris in this case). You are much more likely to get the interface
right this way.I agree, I just wasn't sure as to whether Greg's patch supported
both methods.It does not, and probably will not for the near future; we can only
hope Solaris suports posix_fadvise() at some point.--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com+ If your life is a hard drive, Christ can be your backup. +
* Greg Stark <greg.stark@enterprisedb.com> [081024 10:48]:
I thought about how to support both and ran into probblems that would
make the resulting solutions quite complex.In the libaio view of the world you initiate io and either get a
callback or call another syscall to test if it's complete. Either
approach has problems for postgres. If the process that initiated io
is in the middle of a long query it might takr a long time ot even
never get back to complete the io. The callbacks use threads...And polling for completion has the problem that another process could
be waiting on the io and can't issue a read as long as the first
process has the buffer locked and io in progress. I think aio makes a
lot more sense if you're using threads so you can start a thread to
wait for the io to complete.Actually I think it might be doable with a lot of work but I'm worried
that it would be a lot of extra complexity even when you're not using
it. The current patch doesn't change anything when you're not using it
and actually is quite simple.
In the Solaris async IO, are you bound by direct IO? Does the OS page-cache
still get primed by async reads? If so, how about starting async IO
into a "throwaway" local buffer; treat async IO in the same way as
fadvise, a "pre-load the OS page cache so the real read is quick".
Sure, I understand it's not the "perfect model", but it I don't see
PostgreSQL being refactored enough to have a pure async model happening
any time in the near future...
--
Aidan Van Dyk Create like a god,
aidan@highrise.ca command like a king,
http://www.highrise.ca/ work like a slave.
Hi Simon,
He is going to do some investigation in the methods and
write down the possibilities and then he is going to implement
something from that for PostgreSQL.When will this work be complete? We are days away from completing main
work on 8.4, so you won't get much discussion on this for a few months
yet. Will it be complete in time for 8.5? Or much earlier even?
The first guess is that the work will be done for 8.6. Dano is supposed
to finish the work and defend his thesis in something a bit more than a
year.
Julius, you don't mention what your role is in this. In what sense is
Dano's master's thesis a "we" thing?
I am Dano's mentor and we have a closed contact with Zdenek as well. We
would like the project to become a "we" thing as another reason why to
work on the project. It seems to be better to research some ideas at the
begging and discuss the stuff during development than just individually
writing some piece of code which could be published afterwards.
Especially, when this area seems to be of interest of more people.
Cheers
Julo
Hi Simon,
He is going to do some investigation in the methods and
write down the possibilities and then he is going to implement
something from that for PostgreSQL.When will this work be complete? We are days away from completing main
work on 8.4, so you won't get much discussion on this for a few months
yet. Will it be complete in time for 8.5? Or much earlier even?The first guess is that the work will be done for 8.6.
Dano is supposed to finish the work and defend his thesis in something
a bit more than a year.
Julius, you don't mention what your role is in this. In what sense is
Dano's master's thesis a "we" thing?I am Dano's mentor and we have a closed contact with Zdenek as well.
We would like the project to become a "we" thing as another reason why
to work on the project.
It seems to be better to research some ideas at the begging and discuss
the stuff
during development than just individually writing some piece of code
which could
be published afterwards. Especially, when this area seems to be of
interest of
more people.
Threads are where future performance is going to come from:
General purpose->
http://www.setup32.com/hardware/cpuchipset/32core-processors-intel-reach
e.php
GPU->
http://wwwx.cs.unc.edu/~lastra/Research/GPU_performance.html
http://www.cs.unc.edu/~geom/GPUSORT/results.html
Database engines that want to exploit the ultimate in performance will
utilize multiple threads of execution.
True, the same thing can be realized by multiple processes, but a
process is more expensive than a thread.
Bruce Momjian wrote:
Greg Stark wrote:
I couldn't get async I/O to work on Linux. That is it "worked" but
performed the same as reading one block at a time. On solaris the
situation is reversed.In what way is fadvise a kludge?
I think he is saying AIO gives us more flexibility, but I am unsure we
need it.
absolutely.
posix_fadvise is easy to implement and i would assume that it takes away
a lot of "guessing" on the OS internals side.
the database usually knows that it is gonna read a lot of data in a
certain way and it cannot be a bad idea to give the kernel a hint here.
especially synchronized seq scans and so on are real winners here as you
stop confusing the kernel with XX concurrent readers on the same file.
this can also be an issue with some controller firmwares and so on.
many thanks,
hans