Support Parallel Query Execution in Executor
I have written some experimental code of doing master-slave seqscan in
PostgreSQL. During the work, I feel we had enough infrastructure to support
parallel query execution.
What I did is adding a new node PARA and plug it above the node that we want
to execute in parallel. In this stage, a PARA node is just a SeqScan node,
which is:
typedef struct Para
{
/* TODO: add a union to put all nodes supporting parallism here */
SeqScan scan;
/* Split / Merge / Redistribute */
ParaType type;
/* TODO: other possible parameters */
} Para;
At the execution, the master (the process who receives the query) will wake
up a slave process (an idle ordinary backend) and the slave will pass the
scan results to the master via a shared memory communication-buffer. In
details, the execution is like this:
Master process:
1. PARA init: wake up a slave, pass the queryTree and outerPlan(planTree) to
it by nodeToString();
2. PARA exec:
get an item from the communication-buffer;
if item is a valid tuple
return item;
else
handle other types of item; /* execution done/error */
3. PARA end: do some cleanup.
As we can see from PARA init stage, with even the most simple PARA node, it
is easy to support inter-node parallism.
Slave process (use similar code for autovacuum process):
1. Get queryTree and planTree;
2. Redirect the destReceiver to the communication-buffer;
3. Encapsulate them in an executor and run;
The query plan is like this:
TEST=# explain select max(a), max(b) from t;
QUERY PLAN
----------------------------------------------------------------------
Aggregate (cost=7269.01..7269.02 rows=1 width=53)
-> Para [Split = 1] (cost=10.00..5879.00 rows=278000 width=53)
-> Seq Scan on T (cost=0.00..5879.00 rows=278000 width=53)
(3 rows)
There are some problems I haven't addressed yet. The most difficult one for
me is the xid assignment: master and slaves should see an identical view,
and the key is the xid. I am not sure the correct solution of this problem.
We may use the same xid or use a continuous portion of xids for master and
slaves. There are other problems like the login problem (the master and
slaves should be acting as the same user), the elog message passing etc are
also important but I think we are able to handle them without any problem.
I haven't touched the most difficult part, the parallel query optimizer. But
thanks to the two-phase parallel optimization technique, this part can be
treated as the geqo optimizer, without enough evidence, we don't enable
parallel query execution.
Is there any show-stop reasons of not doing this?
Regards,
Qingqing
On Thu, Apr 06, 2006 at 06:28:33PM +0800, Qingqing Zhou wrote:
I have written some experimental code of doing master-slave seqscan in
PostgreSQL. During the work, I feel we had enough infrastructure to support
parallel query execution.
Good work. One question though, temporary tables. Currently there no
way to get a consistant view of a temporary table from another process.
You probably have to avoid shipping off non-immutable function calls.
Is it possible to deadlock (don't see how, but I may not be looking
hard enough).
Hardest work would be finding what can be safely farmed off.
Nice work though, I hadn't thought of this approach.
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.
On 4/6/06, Qingqing Zhou <zhouqq@cs.toronto.edu> wrote:
I have written some experimental code of doing master-slave seqscan in
PostgreSQL. During the work, I feel we had enough infrastructure to support
parallel query execution.
Great work! I had looked into this a little bit and came to the same
ideas/problems you did, but none of them seemed insurmountable at all.
I'd be interested in working with you on this if you'd like.
I'm interested in hearing Tom's suggestions on this topic too.
--
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
732.331.1324
Jonah H. Harris wrote:
On 4/6/06, Qingqing Zhou <zhouqq@cs.toronto.edu> wrote:
I have written some experimental code of doing master-slave seqscan in
PostgreSQL. During the work, I feel we had enough infrastructure to support
parallel query execution.Great work! I had looked into this a little bit and came to the same
ideas/problems you did, but none of them seemed insurmountable at all.
I'd be interested in working with you on this if you'd like.I'm interested in hearing Tom's suggestions on this topic too.
From time to time there is (often ill-informed) discussion about making
postgres multi-threaded - this strikes me as a possible candidate area
for multithreading, and maybe that would be a way to overcome some of
the problems mentioned.
Just a thought
cheers
andrew
Hear hear ;-)
Regards,
Thomas Hallgren
Andrew Dunstan wrote:
Show quoted text
Jonah H. Harris wrote:
On 4/6/06, Qingqing Zhou <zhouqq@cs.toronto.edu> wrote:
I have written some experimental code of doing master-slave seqscan in
PostgreSQL. During the work, I feel we had enough infrastructure to
support
parallel query execution.Great work! I had looked into this a little bit and came to the same
ideas/problems you did, but none of them seemed insurmountable at all.
I'd be interested in working with you on this if you'd like.I'm interested in hearing Tom's suggestions on this topic too.
From time to time there is (often ill-informed) discussion about making
postgres multi-threaded - this strikes me as a possible candidate area
for multithreading, and maybe that would be a way to overcome some of
the problems mentioned.Just a thought
cheers
andrew
---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend
""Jonah H. Harris"" <jonah.harris@gmail.com> wrote
Great work! I had looked into this a little bit and came to the same
ideas/problems you did, but none of them seemed insurmountable at all.
I'd be interested in working with you on this if you'd like.
Yes, I am happy to work with anyone on the topic. The plan in mind is like
this:
(1) stable the master-slave seqscan: solve all the problems left;
(2) parallize the seqscan: AFAICS, this should not very difficult based on
1, may only need some scan portition assignment;
(3) add an indexscan or other one or two node type to master-slave
solution: this is in order to make the framework extensible;
(4) parallize these node - this will be a big chunk of job;
(5) add a two-phase optimization to the server - we have to consider the
partitioned table in this stage, yet another big chunk of job;
Regards,
Qingqing
On 4/6/06, Qingqing Zhou <zhouqq@cs.toronto.edu> wrote:
""Jonah H. Harris"" <jonah.harris@gmail.com> wrote
Great work! I had looked into this a little bit and came to the same
ideas/problems you did, but none of them seemed insurmountable at all.
I'd be interested in working with you on this if you'd like.
First, I want to second Jonah's enthusiasm. This is very exciting!
Yes, I am happy to work with anyone on the topic. The plan in mind is like
this:
(1) stable the master-slave seqscan: solve all the problems left;
(2) parallize the seqscan: AFAICS, this should not very difficult based on
1, may only need some scan portition assignment;
This is really only a gut feeling for me (it can't be otherwise, since
we can't yet test), but I think parallelizing a single seqscan is
pretty much guaranteed to do nothing, because seqscans, especially on
large tables, are IO bound.
There was plan some time ago (during 8.0 beta, I think) to allow
multiple seqscans from different queries to join each other, such that
scans that begin later start scanning the table at the point, or just
behind the point, that the first running scan is already at. That
plan would reduce IO contention, and buffer and OS cache thrashing, by
having multiple readers pull from the same hose.
I can't see how asking for more than one stream from the same file
would do anything but increase both cache thrashing and IO bandwidth
contention. Am I missing something here?
(3) add an indexscan or other one or two node type to master-slave
solution: this is in order to make the framework extensible;
(4) parallize these node - this will be a big chunk of job;
Now that could be a _big_ win! Especially if tablespaces are used to
balance commonly combined tables and indexes.
(5) add a two-phase optimization to the server - we have to consider the
partitioned table in this stage, yet another big chunk of job;
Same here. This would be a place where parallel seqscans of different
tables (instead of multi-headed scan of one table) could buy you a
lot, especially with proper tablespace use.
Thanks again, Qingqing, for the work on this. I'm very excited about
where this could go. :)
Regards,
Qingqing---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?
--
Mike Rylander
mrylander@gmail.com
GPLS -- PINES Development
Database Developer
http://open-ils.org
Qingquing,
First, I want to second Jonah's enthusiasm. This is very exciting!
Me, three! I didn't think this was ever going to come to Postgres absent
major corporate funding.
This is really only a gut feeling for me (it can't be otherwise, since
we can't yet test), but I think parallelizing a single seqscan is
pretty much guaranteed to do nothing, because seqscans, especially on
large tables, are IO bound.
Actuall, not true. Our current seqscan performance suffers from
produce-consumer fluctuation. GreenPlum and Sun did a whole bunch of
testing on this.
Basically reading a large table off disk does this:
read some table while not processing
process in cpu while not reading
read some more table while not processing
process some more in cpu while not reading
etc.
resulting in an I/O througput graph that looks like:
* * *
* * * * * *
* * * * * *
* * * *
The really annoying part about this, for me personally, is that the peaks
are significantly faster than comparable commercial DBMSes ... but our
average is far less. So even on a single seq scan, parallel query
execution would make a significant difference in performance, possibly as
much as +75% on seq scans of large tables.
--
--Josh
Josh Berkus
Aglio Database Solutions
San Francisco
On 4/7/06, Josh Berkus <josh@agliodbs.com> wrote:
So even on a single seq scan, parallel query
execution would make a significant difference
in performance, possibly as
much as +75% on seq scans of large tables.
I've been looking at several commercial systems which employ dynamic
partitioning for sequential scans of large tables. While 75% sounds a
little optimistic, there is most definitely a sizable performance
increase.
--
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
732.331.1324
Josh Berkus <josh@agliodbs.com> writes:
Basically reading a large table off disk does this:
read some table while not processing
process in cpu while not reading
read some more table while not processing
process some more in cpu while not reading
etc.
resulting in an I/O througput graph that looks like:
* * *
* * * * * *
* * * * * *
* * * *
Interesting ...
The really annoying part about this, for me personally, is that the peaks
are significantly faster than comparable commercial DBMSes ... but our
average is far less. So even on a single seq scan, parallel query
execution would make a significant difference in performance, possibly as
much as +75% on seq scans of large tables.
... but I'm failing to follow where it says that parallel processing
will fix that. All I can foresee in that direction is extra data
transfer costs, bought at the price of portability and locking headaches.
regards, tom lane
On 4/8/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
... but I'm failing to follow where it says that parallel processing
will fix that. All I can foresee in that direction is extra data
transfer costs, bought at the price of portability and locking headaches.
I don't think it's any less portable than the system is now; It's just
enabling multiple slave processes to participate in scans and
processing (parallel query, parallel index builds, parallel sorts,
...) Likewise, the additional I/O cost isn't that much of an issue
because systems which really take advantage of this type of parallel
processing have large bandwidth I/O arrays anyway.
I didn't even want to mention that EVERY other database I know of
(Oracle, DB2, Sybase, SQL Server, Ingres, Bizgres MPP, MaxDB) supports
this, but it's a pretty obvious win for many environments.
--
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
732.331.1324
Hi,
On Sat, 2006-04-08 at 13:16 -0400, Jonah H. Harris wrote:
On 4/8/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
... but I'm failing to follow where it says that parallel processing
will fix that. All I can foresee in that direction is extra data
transfer costs, bought at the price of portability and locking headaches.I don't think it's any less portable than the system is now;
ACK. As long as processes, signals and shared memory are used this could
be as portable as PostgreSQL is now.
It's just
enabling multiple slave processes to participate in scans and
processing (parallel query, parallel index builds, parallel sorts,
...) Likewise, the additional I/O cost isn't that much of an issue
because systems which really take advantage of this type of parallel
processing have large bandwidth I/O arrays anyway.
Ehm.. which additional I/O cost? Or do you count inter-process
communication to I/O?
I'd like to help teaching PostgreSQL the art of parallel query
execution. I have rawly implemented an internal message passing system
on shared memory basis. This allows all pg-processes to send messages to
each other identified by PID. Uppon receiving of a message a process
gets a SIGUSR2 (AFAIC remember now).
On the positive side are:
- portable as I'm using only shmem and signals
- fast (not much copying around of the data in memory)
One downside I see with my approach is:
- the shared memory buffer is limited in size, so it can simply be
'full' and no more messages can be sent before another process reads
its messages and frees the memory for other messages.
In case you're interested I'll compile a patch for review.
Regards
Markus
On 4/8/06, Markus Schiltknecht <markus@bluegap.ch> wrote:
ACK. As long as processes, signals and shared memory are used this could
be as portable as PostgreSQL is now.
This is certainly the case.
Ehm.. which additional I/O cost? Or do you count inter-process
communication to I/O?
Inter-process will add a minimal amount, but if it's done correctly,
you'll still end up ahead.
I'd like to help teaching PostgreSQL the art of parallel query
execution.
Cool, we're not at the communication-level yet, but your help would be
appreciated.
In case you're interested I'll compile a patch for review.
Surely!
--
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
732.331.1324
Hello,
On Sat, 2006-04-08 at 13:45 -0400, Jonah H. Harris wrote:
I'd like to help teaching PostgreSQL the art of parallel query
execution.Cool, we're not at the communication-level yet, but your help would be
appreciated.In case you're interested I'll compile a patch for review.
Surely!
[ This patch is not meant to be included immediately. But it might be
one step towards supporting parallel query execution, IMHO. ]
This is my internal message passing implementation. I'd probably also
need to document its use...
It's not tested except for 'it works for me'. And it still misses the
'wrap around' functionality to recycle the buffer. Also, I should
probably rename buffer.h and buffer.c since there are a lot of other
buffers in PostgreSQL.
Markus
Attachments:
imsg.patchtext/x-patch; charset=utf-8; name=imsg.patchDownload+670-2
"Jonah H. Harris" <jonah.harris@gmail.com> writes:
On 4/8/06, Markus Schiltknecht <markus@bluegap.ch> wrote:
Ehm.. which additional I/O cost? Or do you count inter-process
communication to I/O?
Inter-process will add a minimal amount, but if it's done correctly,
you'll still end up ahead.
This is exactly the bit of optimism I was questioning. We've already
been sweating blood trying to reduce multiprocessor contention on data
structures in which collisions ought to be avoidable (ie, buffer arrays
where you hope not everyone is hitting the same buffer at once). I
think passing large volumes of data between different processes is going
to incur quite a lot of locking overhead, pipeline stalls for cache line
transfers, etc, etc, because heavy contention for the transfer buffer is
simply not going to be avoidable.
regards, tom lane
On 4/8/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
passing large volumes of data between different processes is going
to incur quite a lot of locking overhead, pipeline stalls for cache line
transfers, etc, etc, because heavy contention for the transfer buffer is
simply not going to be avoidable.
I don't think anyone believes there isn't going to be contention
somewhere. Anyone that's developed a parallel system acknowledges
there's a price to pay with parallelism (in complexity) which requires
good algorithms, strategy, and design.
I may certainly be misunderstanding you, but it seems like you want to
avoid parallelism altogether. Aside from PostgreSQL, the only systems
I know of which don't support parallelism in query execution are MySQL
and Firebird, but I don't see competing at the low end as a compelling
argument. Similarly, I don't think every major vendor put money into
developing something that isn't useful.
I do agree that there are many times you won't want to have parallel
query. The reasons for employing parallelism is very dependent on the
environment and application it's used in, so it's certainly going to
be up to the user to decide whether they want to use it or not.
We're currently playing with a proof-of-concept to find issues before
proposing a design. This doesn't affect the community in any way.
Rather than negativity, I'd really like to see your suggestions on
avoiding contention as much as possible; your ideas may certainly get
us past several obstacles more quickly.
--
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
732.331.1324
On 4/8/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
This is exactly the bit of optimism I was questioning. We've already
been sweating blood trying to reduce multiprocessor contention on data
structures in which collisions ought to be avoidable (ie, buffer arrays
where you hope not everyone is hitting the same buffer at once). I
think passing large volumes of data between different processes is going
to incur quite a lot of locking overhead, pipeline stalls for cache line
transfers, etc, etc, because heavy contention for the transfer buffer is
simply not going to be avoidable.
We should consider true parallel execution and overlapping execution
with I/O as distinct cases.
For example, one case made in this thread involved bursty performance
with seqscans presumably because the I/O was stalling while processing
was being performed. In general this can be avoided without parallel
execution through the use of non-blocking I/O and making an effort to
keep the request pipeline full.
There are other cases where it is useful to perform parallel I/O
without parallel processing.. for example: a query that will perform
an index lookup per row can benefit from running some number of those
lookups in parallel in order to hide the lookup latency and give the
OS and disk elevators a chance to make the random accesses a little
more orderly. This can be accomplished without true parallel
processing. (Perhaps PG does this already?)
Parallel execution to get access to more CPU and memory bandwidth is a
fine thing, and worth the costs in many cases... but it shouldn't be
used as an easy way to get parallel IO without careful consideration.
Greg,
On 4/8/06 5:43 PM, "Gregory Maxwell" <gmaxwell@gmail.com> wrote:
For example, one case made in this thread involved bursty performance
with seqscans presumably because the I/O was stalling while processing
was being performed. In general this can be avoided without parallel
execution through the use of non-blocking I/O and making an effort to
keep the request pipeline full.
I agree - there is a real continuing need for overlapped scan and execution,
though I'm not sure quite how to get it efficiently using inter-process
communication. The AIO interface is nicely proven at this point, but to
Tom's point, it's non-portable.
What we see now is that the executor is CPU bound at about 300MB/s scan rate
using modern Opteron CPUs. Using some kind of overlap strategy would help
this a lot - using programmable readahead to stage I/O requests
asynchronously from the scan node would be better than OS hints like
fadvise(), but the tuning would be tricky IMO.
- Luke
Gregory Maxwell wrote:
We should consider true parallel execution and overlapping execution
with I/O as distinct cases.For example, one case made in this thread involved bursty performance
with seqscans presumably because the I/O was stalling while processing
was being performed. In general this can be avoided without parallel
execution through the use of non-blocking I/O and making an effort to
keep the request pipeline full.There are other cases where it is useful to perform parallel I/O
without parallel processing.. for example: a query that will perform
an index lookup per row can benefit from running some number of those
lookups in parallel in order to hide the lookup latency and give the
OS and disk elevators a chance to make the random accesses a little
more orderly. This can be accomplished without true parallel
processing. (Perhaps PG does this already?)
I have done some testing more along these lines with an old fork of
postgres code
(2001). In my tests, I used a thread to delegate out the actual heap
scan of the
SeqScan. The job of the "slave" thread the was to fault in buffer
pages and
determine the time validity of the tuples. ItemPointers are passed back
to the
"master" thread via a common memory area guarded by mutex locking. The
master thread is then responsible for converting the ItemPointers to
HeapTuples
and finishing the execution run. I added a little hack to the buffer
code to force
pages read into the buffer to stay at the back of the free buffer list
until the master
thread has had a chance to use it. These are the parameters of my test
table.
Pages 9459: ; Tup 961187: Live 673029, Dead 288158
Average tuple size is 70 bytes
create table test (rand int, varchar(256) message)
So far I've done a couple of runs with a single query on a 2 processor
machine with
the following results via dtrace.
select * from test;
CPU ID FUNCTION:NAME
1 46218 ExecEndSeqScan:return Inline scan time 81729
0 46216 ExecEndDelegatedSeqScan:return Delegated scan time 59903
0 46218 ExecEndSeqScan:return Inline scan time 95708
0 46216 ExecEndDelegatedSeqScan:return Delegated scan time 58255
0 46218 ExecEndSeqScan:return Inline scan time 79028
0 46216 ExecEndDelegatedSeqScan:return Delegated scan time 50500
average 34% decrease in total time using the delegated scan.
A very crude, simple test but I think it shows some promise.
I know I used threads but you could probably just as easily use a slave
process
and pass ItemPointers via pipes or shared memory.
Thanks,
Myron Scott
Myron,
First, this sounds really good!
On 4/8/06 9:54 PM, "Myron Scott" <lister@sacadia.com> wrote:
I added a little hack to the buffer
code to force
pages read into the buffer to stay at the back of the free buffer list
until the master
thread has had a chance to use it.
This is the part I'm curious about - is this using the shared_buffers region
in a circular buffer fashion to store pre-fetched pages?
One thing I've wondered about is: how much memory is required to get
efficient overlap? Did you find that you had to tune the amount of buffer
memory to get the performance to work out?
- Luke