parallel mode and parallel contexts

Started by Robert Haasover 11 years ago80 messageshackers
Jump to latest
#1Robert Haas
robertmhaas@gmail.com

Attached is a patch that adds two new concepts: parallel mode, and
parallel contexts. The idea of this code is to provide a framework
for specific parallel things that you want to do, such as parallel
sequential scan or parallel sort. When you're in parallel mode,
certain operations - like DDL, and anything that would update the
command counter - are prohibited. But you gain the ability to create
a parallel context, which in turn can be used to fire up parallel
workers. And if you do that, then your snapshot, combo CID hash, and
GUC values will be copied to the worker, which is handy.

This patch is very much half-baked. Among the things that aren't right yet:

- There's no handling of heavyweight locking, so I'm quite sure it'll
be possible to cause undetected deadlocks if you work at it. There
are some existing threads on this topic and perhaps we can incorporate
one of those concepts into this patch, but this version does not.
- There's no provision for copying the parent's XID and sub-XIDs, if
any, to the background workers, which means that if you use this and
your transaction has written data, you will get wrong answers, because
TransactionIdIsCurrentTransactionId() will do the wrong thing.
- There's no really deep integration with the transaction system yet.
Previous discussions seem to point toward the need to do various types
of coordinated cleanup when the parallel phase is done, or when an
error happens. In particular, you probably don't want the abort
record to get written while there are still possibly backends that are
part of that transaction doing work; and you certainly don't want
files created by the current transaction to get removed while some
other backend is still writing them. The right way to work all of
this out needs some deep thought; agreeing on what the design should
be is probably harder than implement it.

Despite the above, I think this does a fairly good job laying out how
I believe parallelism can be made to work in PostgreSQL: copy a bunch
of state from the user backend to the parallel workers, compute for a
while, and then shut everything down. Meanwhile, while parallelism is
running, forbid changes to state that's already been synchronized, so
that things don't get out of step. I think the patch it shows how the
act of synchronizing state from the master to the workers can be made
quite modular and painless, even though it doesn't synchronize
everything relevant. I'd really appreciate any design thoughts anyone
may have on how to fix the problems mentioned above, how to fix any
other problems you foresee, or even just a list of reasons why you
think this will blow up.

What I think is that we're really pretty close to do real parallelism,
and that this is probably the last major piece of infrastructure that
we need in order to support parallel execution in a reasonable way.
That's a pretty bold statement, but I believe it to be true: despite
the limitations of the current version of this patch, I think we're
very close to being able to sit down and code up a parallel algorithm
in PostgreSQL and have that not be all that hard. Once we get the
first one, I expect a whole bunch more to come together far more
quickly than the first one did.

I would be remiss if I failed to mention that this patch includes work
by my colleagues Amit Kapila, Rushabh Lathia, and Jeevan Chalke, as
well as my former colleague Noah Misch; and that it would not have
been possible without the patient support of EnterpriseDB management.

Thanks,

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

parallel-mode-v0.patchtext/x-patch; charset=US-ASCII; name=parallel-mode-v0.patchDownload+1727-42
#2Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#1)
Re: parallel mode and parallel contexts

On 12 December 2014 at 22:52, Robert Haas <robertmhaas@gmail.com> wrote:

I would be remiss if I failed to mention that this patch includes work
by my colleagues Amit Kapila, Rushabh Lathia, and Jeevan Chalke, as
well as my former colleague Noah Misch; and that it would not have
been possible without the patient support of EnterpriseDB management.

Noted and thanks to all.

I will make this my priority for review, but regrettably not until New
Year, about 2.5-3 weeks away.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#2)
Re: parallel mode and parallel contexts

On Wed, Dec 17, 2014 at 9:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 12 December 2014 at 22:52, Robert Haas <robertmhaas@gmail.com> wrote:

I would be remiss if I failed to mention that this patch includes work
by my colleagues Amit Kapila, Rushabh Lathia, and Jeevan Chalke, as
well as my former colleague Noah Misch; and that it would not have
been possible without the patient support of EnterpriseDB management.

Noted and thanks to all.

I will make this my priority for review, but regrettably not until New
Year, about 2.5-3 weeks away.

OK!

In the meantime, I had a good chat with Heikki on IM yesterday which
gave me some new ideas on how to fix up the transaction handling in
here, which I am working on implementing. So hopefully I will have
that by then.

I am also hoping Amit will be adapting his parallel seq-scan patch to
this framework.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#3)
Re: parallel mode and parallel contexts

On Wed, Dec 17, 2014 at 2:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:

In the meantime, I had a good chat with Heikki on IM yesterday which
gave me some new ideas on how to fix up the transaction handling in
here, which I am working on implementing. So hopefully I will have
that by then.

And here is a new version. This version has some real integration
with the transaction system, along the lines of what Heikki suggested
to me, so hopefully tuple visibility calculations in a parallel worker
will now produce the right answers, though I need to integrate this
with code to actually do something-or-other in parallel in order to
really test that. There are still some problems with parallel worker
shutdown. As hard as I tried to fight it, I'm gradually resigning
myself to the fact that we're probably going to have to set things up
so that the worker waits for all of its children to die during abort
processing (and during commit processing, but that's not bothersome).
Otherwise, to take just one example, chaos potentially ensues if you
run a parallel query in a subtransaction and then roll back to a
savepoint. But this version just kills the workers and doesn't
actually wait for them to die; I'll see about fixing that, but wanted
to send this around for comments first.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

parallel-mode-v0.1.patchtext/x-patch; charset=US-ASCII; name=parallel-mode-v0.1.patchDownload+2006-56
#5Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#3)
Re: parallel mode and parallel contexts

On Thu, Dec 18, 2014 at 4:53 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Dec 17, 2014 at 9:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 12 December 2014 at 22:52, Robert Haas <robertmhaas@gmail.com> wrote:

I would be remiss if I failed to mention that this patch includes work
by my colleagues Amit Kapila, Rushabh Lathia, and Jeevan Chalke, as
well as my former colleague Noah Misch; and that it would not have
been possible without the patient support of EnterpriseDB management.

Noted and thanks to all.

I will make this my priority for review, but regrettably not until New
Year, about 2.5-3 weeks away.

OK!

In the meantime, I had a good chat with Heikki on IM yesterday which
gave me some new ideas on how to fix up the transaction handling in
here, which I am working on implementing. So hopefully I will have
that by then.

I am also hoping Amit will be adapting his parallel seq-scan patch to
this framework.

Simon, if you are planning to review this patch soon, could you add
your name as a reviewer for this patch in the commit fest app?
Thanks,
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#4)
Re: parallel mode and parallel contexts

On Fri, Dec 19, 2014 at 8:59 AM, Robert Haas <robertmhaas@gmail.com> wrote:

And here is a new version.

Here is another new version, with lots of bugs fixed. The worker
shutdown sequence is now much more robust, although I think there may
still be a bug or two lurking, and I fixed a bunch of other things
too. There's now a function called parallel_count() in the
parallel_dummy extension contained herein, which does a parallel count
of a relation you choose:

rhaas=# select count(*) from pgbench_accounts;
count
---------
4000000
(1 row)

Time: 396.635 ms
rhaas=# select parallel_count('pgbench_accounts'::regclass, 0);
NOTICE: PID 2429 counted 4000000 tuples
parallel_count
----------------
4000000
(1 row)

Time: 234.445 ms
rhaas=# select parallel_count('pgbench_accounts'::regclass, 4);
NOTICE: PID 2499 counted 583343 tuples
CONTEXT: parallel worker, pid 2499
NOTICE: PID 2500 counted 646478 tuples
CONTEXT: parallel worker, pid 2500
NOTICE: PID 2501 counted 599813 tuples
CONTEXT: parallel worker, pid 2501
NOTICE: PID 2502 counted 611389 tuples
CONTEXT: parallel worker, pid 2502
NOTICE: PID 2429 counted 1558977 tuples
parallel_count
----------------
4000000
(1 row)

Time: 150.004 ms
rhaas=# select parallel_count('pgbench_accounts'::regclass, 8);
NOTICE: PID 2429 counted 1267566 tuples
NOTICE: PID 2504 counted 346236 tuples
CONTEXT: parallel worker, pid 2504
NOTICE: PID 2505 counted 345077 tuples
CONTEXT: parallel worker, pid 2505
NOTICE: PID 2506 counted 355325 tuples
CONTEXT: parallel worker, pid 2506
NOTICE: PID 2507 counted 350872 tuples
CONTEXT: parallel worker, pid 2507
NOTICE: PID 2508 counted 338855 tuples
CONTEXT: parallel worker, pid 2508
NOTICE: PID 2509 counted 336903 tuples
CONTEXT: parallel worker, pid 2509
NOTICE: PID 2511 counted 326716 tuples
CONTEXT: parallel worker, pid 2511
NOTICE: PID 2510 counted 332450 tuples
CONTEXT: parallel worker, pid 2510
parallel_count
----------------
4000000
(1 row)

Time: 166.347 ms

This example table (pgbench_accounts, scale 40, ~537 MB) is small
enough that parallelism doesn't really make sense; you can see from
the notice messages above that the master manages to count a quarter
of the table before the workers get themselves up and running. The
pointer is rather to show how the infrastructure works and that it can
be used to write code to do practically useful tasks in a surprisingly
small number of lines of code; parallel_count is only maybe ~100 lines
on top of the base patch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

parallel-mode-v0.2.patchtext/x-patch; charset=US-ASCII; name=parallel-mode-v0.2.patchDownload+2274-56
#7Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#3)
Re: parallel mode and parallel contexts

On Thu, Dec 18, 2014 at 1:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:

In the meantime, I had a good chat with Heikki on IM yesterday which
gave me some new ideas on how to fix up the transaction handling in
here, which I am working on implementing. So hopefully I will have
that by then.

I am also hoping Amit will be adapting his parallel seq-scan patch to
this framework.

While working on parallel seq-scan patch to adapt this framework, I
noticed few things and have questions regrading the same.

1.
Currently parallel worker just attaches to error queue, for tuple queue
do you expect it to be done in the same place or in the caller supplied
function, if later then we need segment address as input to that function
to attach queue to the segment(shm_mq_attach()).
Another question, I have in this regard is that if we have redirected
messages to error queue by using pq_redirect_to_shm_mq, then how can
we set tuple queue for the same purpose. Similarly I think more handling
is needed for tuple queue in master backend and the answer to above will
dictate what is the best way to do it.

2.
Currently there is no interface for wait_for_workers_to_become_ready()
in your patch, don't you think it is important that before start of fetching
tuples, we should make sure all workers are started, what if some worker
fails to start?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#8Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#6)
Re: parallel mode and parallel contexts

On 2014-12-22 14:14:31 -0500, Robert Haas wrote:

On Fri, Dec 19, 2014 at 8:59 AM, Robert Haas <robertmhaas@gmail.com> wrote:

And here is a new version.

Here is another new version, with lots of bugs fixed.

A couple remarks:
* Shouldn't this provide more infrastructure to deal with the fact that
we might get less parallel workers than we had planned for?
* I wonder if parallel contexts shouldn't be tracked via resowners
* combocid.c should error out in parallel mode, as insurance
* I doubt it's a good idea to allow heap_insert, heap_inplace_update,
index_insert. I'm not convinced that those will be handled correct and
relaxing restrictions is easier than adding them.
* I'd much rather separate HandleParallelMessageInterrupt() in one
variant that just tells the machinery there's a interrupt (called from
signal handlers) and one that actually handles them. We shouldn't even
consider adding any new code that does allocations, errors and such in
signal handlers. That's just a *bad* idea.
* I'm not a fan of the shm_toc stuff... Verbose and hard to read. I'd
much rather place a struct there and be careful about not using
pointers. That also would obliviate the need to reserve some ids.
* Doesn't the restriction in GetSerializableTransactionSnapshotInt()
apply for repeatable read just the same?
* I'm not a fan of relying on GetComboCommandId() to restore things in
the same order as before.
* I'd say go ahead and commit the BgworkerByOid thing
earlier/independently. I've seen need for that a couple times.
* s/entrypints/entrypoints/

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#7)
Re: parallel mode and parallel contexts

On Fri, Jan 2, 2015 at 9:04 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

While working on parallel seq-scan patch to adapt this framework, I
noticed few things and have questions regrading the same.

1.
Currently parallel worker just attaches to error queue, for tuple queue
do you expect it to be done in the same place or in the caller supplied
function, if later then we need segment address as input to that function
to attach queue to the segment(shm_mq_attach()).
Another question, I have in this regard is that if we have redirected
messages to error queue by using pq_redirect_to_shm_mq, then how can
we set tuple queue for the same purpose. Similarly I think more handling
is needed for tuple queue in master backend and the answer to above will
dictate what is the best way to do it.

I've come to the conclusion that it's a bad idea for tuples to be sent
through the same queue as errors. We want errors (or notices, but
especially errors) to be processed promptly, but there may be a
considerable delay in processing tuples. For example, imagine a plan
that looks like this:

Nested Loop
-> Parallel Seq Scan on p
-> Index Scan on q
Index Scan: q.x = p.x

The parallel workers should fill up the tuple queues used by the
parallel seq scan so that the master doesn't have to do any of that
work itself. Therefore, the normal situation will be that those tuple
queues are all full. If an error occurs in a worker at that point, it
can't add it to the tuple queue, because the tuple queue is full. But
even if it could do that, the master then won't notice the error until
it reads all of the queued-up tuple messges that are in the queue in
front of the error, and maybe some messages from the other queues as
well, since it probably round-robins between the queues or something
like that. Basically, it could do a lot of extra work before noticing
that error in there.

Now we could avoid that by having the master read messages from the
queue immediately and just save them off to local storage if they
aren't error messages. But that's not very desirable either, because
now we have no flow-control. The workers will just keep spamming
tuples that the master isn't ready for into the queues, and the master
will keep reading them and saving them to local storage, and
eventually it will run out of memory and die.

We could engineer some solution to this problem, of course, but it
seems quite a bit simpler to just have two queues. The error queues
don't need to be very big (I made them 16kB, which is trivial on any
system on which you care about having working parallelism) and the
tuple queues can be sized as needed to avoid pipeline stalls.

2.
Currently there is no interface for wait_for_workers_to_become_ready()
in your patch, don't you think it is important that before start of fetching
tuples, we should make sure all workers are started, what if some worker
fails to start?

I think that, in general, getting the most benefit out of parallelism
means *avoiding* situations where backends have to wait for each
other. If the relation being scanned is not too large, the user
backend might be able to finish the whole scan - or a significant
fraction of it - before the workers initialize. Of course in that
case it might have been a bad idea to parallelize in the first place,
but we should still try to make the best of the situation. If some
worker fails to start, then instead of having the full degree N
parallelism we were hoping for, we have some degree K < N, so things
will take a little longer, but everything should still work.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#8)
Re: parallel mode and parallel contexts

On Sat, Jan 3, 2015 at 7:31 PM, Andres Freund <andres@2ndquadrant.com> wrote:

A couple remarks:
* Shouldn't this provide more infrastructure to deal with the fact that
we might get less parallel workers than we had planned for?

Maybe, but I'm not really sure what that should look like. My working
theory is that individual parallel applications (executor nodes, or
functions that use parallelism internally, or whatever) should be
coded in such a way that they work correctly even if the number of
workers that starts is smaller than what they requested, or even zero.
It may turn out that this is impractical in some cases, but I don't
know what those cases are yet so I don't know what the common threads
will be.

I think parallel seq scan should work by establishing N tuple queues
(where N is the number of workers we have). When asked to return a
tuple, the master should poll all of those queues one after another
until it finds one that contains a tuple. If none of them contain a
tuple then it should claim a block to scan and return a tuple from
that block. That way, if there are enough running workers to keep up
with the master's demand for tuples, the master won't do any of the
actual scan itself. But if the workers can't keep up -- e.g. suppose
90% of the CPU consumed by the query is in the filter qual for the
scan -- then the master can pitch in along with everyone else. As a
non-trivial fringe benefit, if the workers don't all start, or take a
while to initialize, the user backend isn't stalled meanwhile.

* I wonder if parallel contexts shouldn't be tracked via resowners

That is a good question. I confess that I'm somewhat fuzzy about
which things should be tracked via the resowner mechanism vs. which
things should have their own bespoke bookkeeping. However, the
AtEOXact_Parallel() stuff happens well before ResourceOwnerRelease(),
which makes merging them seem not particularly clean.

* combocid.c should error out in parallel mode, as insurance

Eh, which function? HeapTupleHeaderGetCmin(),
HeapTupleHeaderGetCmax(), and AtEOXact_ComboCid() are intended to work
in parallel mode. HeapTupleHeaderAdjustCmax() could
Assert(!IsInParallelMode()) but the only calls are in heap_update()
and heap_delete() which already have error checks, so putting yet
another elog() there seems like overkill.

* I doubt it's a good idea to allow heap_insert, heap_inplace_update,
index_insert. I'm not convinced that those will be handled correct and
relaxing restrictions is easier than adding them.

I'm fine with adding checks to heap_insert() and
heap_inplace_update(). I'm not sure we really need to add anything to
index_insert(); how are we going to get there without hitting some
other prohibited operation first?

* I'd much rather separate HandleParallelMessageInterrupt() in one
variant that just tells the machinery there's a interrupt (called from
signal handlers) and one that actually handles them. We shouldn't even
consider adding any new code that does allocations, errors and such in
signal handlers. That's just a *bad* idea.

That's a nice idea, but I didn't invent the way this crap works today.
ImmediateInterruptOK gets set to true while performing a lock wait,
and we need to be able to respond to errors while in that state. I
think there's got to be a better way to handle that than force every
asynchronous operation to have to cope with the fact that
ImmediateInterruptOK may be set or not set, but as of today that's
what you have to do.

* I'm not a fan of the shm_toc stuff... Verbose and hard to read. I'd
much rather place a struct there and be careful about not using
pointers. That also would obliviate the need to reserve some ids.

I don't see how that's going to work with variable-size data
structures, and a bunch of the things that we need to serialize are
variable-size. Moreover, I'm not really interested in rehashing this
design again. I know it's not your favorite; you've said that before.
But it makes it possible to write code to do useful things in
parallel, a capability that we do not otherwise have. And I like it
fine.

* Doesn't the restriction in GetSerializableTransactionSnapshotInt()
apply for repeatable read just the same?

Yeah. I'm not sure whether we really need that check at all, because
there is a check in GetTransactionSnapshot() that probably checks the
same thing.

* I'm not a fan of relying on GetComboCommandId() to restore things in
the same order as before.

I thought that was a little wonky, but it's not likely to break, and
there is an elog(ERROR) there to catch it if it does, so I'd rather
not make it more complicated.

* I'd say go ahead and commit the BgworkerByOid thing
earlier/independently. I've seen need for that a couple times.

Woohoo! I was hoping someone would say that.

* s/entrypints/entrypoints/

Thanks.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#10)
Re: parallel mode and parallel contexts

Robert Haas <robertmhaas@gmail.com> writes:

On Sat, Jan 3, 2015 at 7:31 PM, Andres Freund <andres@2ndquadrant.com> wrote:

* I wonder if parallel contexts shouldn't be tracked via resowners

That is a good question. I confess that I'm somewhat fuzzy about
which things should be tracked via the resowner mechanism vs. which
things should have their own bespoke bookkeeping. However, the
AtEOXact_Parallel() stuff happens well before ResourceOwnerRelease(),
which makes merging them seem not particularly clean.

FWIW, the resowner mechanism was never meant as a substitute for bespoke
bookkeeping. What it is is a helper mechanism to reduce the need for
PG_TRY blocks that guarantee that a resource-releasing function will be
called even in error paths. I'm not sure whether that analogy applies
well in parallel-execution cases.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#6)
Re: parallel mode and parallel contexts

On 22 December 2014 at 19:14, Robert Haas <robertmhaas@gmail.com> wrote:

Here is another new version, with lots of bugs fixed.

An initial blind review, independent of other comments already made on thread.

OK src/backend/access/heap/heapam.c
heapam.c prohibitions on update and delete look fine

OK src/backend/access/transam/Makefile

OK src/backend/access/transam/README.parallel
README.parallel and all concepts look good

PARTIAL src/backend/access/transam/parallel.c
wait_patiently mentioned in comment but doesn’t exist
Why not make nworkers into a uint?
Trampoline? Really? I think we should define what we mean by that,
somewhere, rather than just use the term as if it was self-evident.
Entrypints?
..

OK src/backend/access/transam/varsup.c

QUESTIONS src/backend/access/transam/xact.c
These comments don’t have any explanation or justification

+ * This stores XID of the toplevel transaction to which we belong.
+ * Normally, it's the same as TopTransactionState.transactionId, but in
+ * a parallel worker it may not be.
+ * If we're a parallel worker, there may be additional XIDs that we need
+ * to regard as part of our transaction even though they don't appear
+ * in the transaction state stack.
This comment is copied-and-pasted too many times for safety and elegance
+       /*
+        * Workers synchronize transaction state at the beginning of
each parallel
+        * operation, so we can't account for transaction state change
after that
+        * point.  (Note that this check will certainly error out if
s->blockState
+        * is TBLOCK_PARALLEL_INPROGRESS, so we can treat that as an
invalid case
+        * below.)
+        */
We need a single Check() that contains more detail and comments within

Major comments

* Parallel stuff at start sounded OK
* Transaction stuff strikes me as overcomplicated and error prone.
Given there is no explanation or justification for most of it, I’d be
inclined to question why its there
* If we have a trampoline for DSM, why don’t we use a trampoline for
snapshots, then you wouldn’t need to do all this serialize stuff

This is very impressive, but it looks like we’re trying to lift too
much weight on the first lift. If we want to go anywhere near this, we
need to have very clear documentation about how and why its like that.
I’m actually very sorry to say that because the review started very
well, much much better than most.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#10)
Re: parallel mode and parallel contexts

On Mon, Jan 5, 2015 at 9:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Jan 3, 2015 at 7:31 PM, Andres Freund <andres@2ndquadrant.com>

wrote:

* Doesn't the restriction in GetSerializableTransactionSnapshotInt()
apply for repeatable read just the same?

Yeah. I'm not sure whether we really need that check at all, because
there is a check in GetTransactionSnapshot() that probably checks the
same thing.

The check in GetSerializableTransactionSnapshotInt() will also prohibit
any user/caller of SetSerializableTransactionSnapshot() in parallel mode
as that can't be prohibited by GetTransactionSnapshot().

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#14Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#12)
Re: parallel mode and parallel contexts

On Mon, Jan 5, 2015 at 6:21 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

PARTIAL src/backend/access/transam/parallel.c
wait_patiently mentioned in comment but doesn’t exist

Woops. Fixed. Originally, DestroyParallelContext() had a Boolean
argument wait_patiently, specifying whether it should exit at once or
do the equivalent of WaitForParallelWorkersToFinish() first. But
that turned out not to work well - for example, in the included
parallel_count example, it's very important to (1) first wait for all
the workers to exit, (2) then read the final count, and (3) only after
that destroy the dynamic shared memory segment. So
WaitForParallelWorkersToFinish() got separated out into a different
function, but I failed to update the comments.

Why not make nworkers into a uint?

We don't have or use a type called uint. I could make it uint32 or
uint64 or uint16, but I don't see much advantage in that. I could
also make it unsigned, but we make very little use of the unsigned
type generally, so I picked int as most consistent with our general
practice. If there's a consensus that something else is much better
than int, I will change it throughout, but I think we have bigger fish
to fry.

Trampoline? Really? I think we should define what we mean by that,
somewhere, rather than just use the term as if it was self-evident.

Comments improved.

Entrypints?

Already noted by Andres; fixed in the attached version.

These comments don’t have any explanation or justification

OK, I rewrote them. Hopefully it's better now.

This comment is copied-and-pasted too many times for safety and elegance

It's not the same comment each time it appears; it appears in the
exact form you quoted just twice. It does get a little long-winded, I
guess, but I think that parallelism is a sufficiently-unfamiliar
concept to us as a group that it makes sense to have a detailed
comment in each place explaining exactly why that particular callsite
requires a check, so that's what I've tried to do.

* Parallel stuff at start sounded OK
* Transaction stuff strikes me as overcomplicated and error prone.
Given there is no explanation or justification for most of it, I’d be
inclined to question why its there

Gosh, I was pretty pleased with how simple the transaction integration
turned out to be. Most of what's there right now is either (a) code
to copy state from the master to the parallel workers or (b) code to
throw errors if the workers try to things that aren't safe. I suspect
there are a few things missing, but I don't see anything there that
looks unnecessary.

* If we have a trampoline for DSM, why don’t we use a trampoline for
snapshots, then you wouldn’t need to do all this serialize stuff

The trampoline is just to let extensions use this infrastructure if
they want to; there's no way to avoid the serialization-and-restore
stuff unless we switch to creating the child processes using fork(),
but that would:

1. Not work on Windows.

2. Require the postmaster to deal with processes that are not its
immediate children.

3. Possibly introduce other bugs.

Since I've spent a year and a half trying to get this method to work
and am now, I think, almost there, I'm not particularly sanguine about
totally changing the approach.

This is very impressive, but it looks like we’re trying to lift too
much weight on the first lift. If we want to go anywhere near this, we
need to have very clear documentation about how and why its like that.
I’m actually very sorry to say that because the review started very
well, much much better than most.

When I posted the group locking patch, it got criticized because it
didn't actually do anything useful by itself; similarly, the
pg_background patch was criticized for not being a large enough step
toward parallelism. So, this time, I posted something more
comprehensive. I don't think it's quite complete yet. I expect a
committable version of this patch to be maybe another 500-1000 lines
over what I have here right now -- I think it needs to do something
about heavyweight locking, and I expect that there are some unsafe
things that aren't quite prohibited yet. But the current patch is
only 2300 lines, which is not astonishingly large for a feature of
this magnitude; if anything, I'd say it's surprisingly small, due to a
year and a half of effort laying the necessary groundwork via long
series of preliminary commits. I'm not unwilling to divide this up
some more if we can agree on a way to do that that makes sense, but I
think we're nearing the point where we need to take the plunge and
say, look, this is version one of parallelism. Thunk.

In addition to the changes mentioned above, the attached version
prohibits a few more things (as suggested by Andres) and passes the
dsm_segment to the user-supplied entrypoint (as requested off-list by
Andres, because otherwise you can't set up additional shm_mq
structures).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

parallel-mode-v0.3.patchtext/x-patch; charset=US-ASCII; name=parallel-mode-v0.3.patchDownload+2307-56
#15Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#14)
Re: parallel mode and parallel contexts

On 6 January 2015 at 16:33, Robert Haas <robertmhaas@gmail.com> wrote:

These comments don’t have any explanation or justification

OK, I rewrote them. Hopefully it's better now.

Thanks for new version and answers.

* Transaction stuff strikes me as overcomplicated and error prone.
Given there is no explanation or justification for most of it, I’d be
inclined to question why its there

Gosh, I was pretty pleased with how simple the transaction integration
turned out to be. Most of what's there right now is either (a) code
to copy state from the master to the parallel workers or (b) code to
throw errors if the workers try to things that aren't safe. I suspect
there are a few things missing, but I don't see anything there that
looks unnecessary.

If you can explain it in more detail in comments and README, I may
agree. At present, I don't get it and it makes me nervous.

The comment says
"Only the top frame of the transaction state stack is copied to a parallel
worker"
but I'm not sure why.

Top meaning the current subxact or the main xact?
If main, why are do we need XactTopTransactionId

This is very impressive, but it looks like we’re trying to lift too
much weight on the first lift. If we want to go anywhere near this, we
need to have very clear documentation about how and why its like that.
I’m actually very sorry to say that because the review started very
well, much much better than most.

When I posted the group locking patch, it got criticized because it
didn't actually do anything useful by itself; similarly, the
pg_background patch was criticized for not being a large enough step
toward parallelism. So, this time, I posted something more
comprehensive. I don't think it's quite complete yet. I expect a
committable version of this patch to be maybe another 500-1000 lines
over what I have here right now -- I think it needs to do something
about heavyweight locking, and I expect that there are some unsafe
things that aren't quite prohibited yet. But the current patch is
only 2300 lines, which is not astonishingly large for a feature of
this magnitude; if anything, I'd say it's surprisingly small, due to a
year and a half of effort laying the necessary groundwork via long
series of preliminary commits. I'm not unwilling to divide this up
some more if we can agree on a way to do that that makes sense, but I
think we're nearing the point where we need to take the plunge and
say, look, this is version one of parallelism. Thunk.

I want this also; the only debate is where to draw the line and please
don't see that as criticism.

I'm very happy it's so short, I agree it could be longer.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#15)
Re: parallel mode and parallel contexts

On Tue, Jan 6, 2015 at 3:04 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

If you can explain it in more detail in comments and README, I may
agree. At present, I don't get it and it makes me nervous.

The comment says
"Only the top frame of the transaction state stack is copied to a parallel
worker"
but I'm not sure why.

Top meaning the current subxact or the main xact?
If main, why are do we need XactTopTransactionId

Current subxact.

I initially thought of copying the entire TransactionStateData stack,
but Heikki suggested (via IM) that I do it this way instead. I
believe his concern was that it's never valid to commit or roll back
to a subtransaction that is not at the top of the stack, and if you
don't copy the stack, you avoid the risk of somehow ending up in that
state. Also, you avoid having to invent resource owners for
(sub)transactions that don't really exist in the current process. On
the other hand, you do end up with a few special cases that wouldn't
exist with the other approach. Still, I'm pretty happy to have taken
Heikki's advice: it was certainly simple to implement this way, plus
hopefully that way at least one person likes what I ended up with.
:-)

What else needs clarification?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#16)
Re: parallel mode and parallel contexts

On 6 January 2015 at 21:01, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jan 6, 2015 at 3:04 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

If you can explain it in more detail in comments and README, I may
agree. At present, I don't get it and it makes me nervous.

The comment says
"Only the top frame of the transaction state stack is copied to a parallel
worker"
but I'm not sure why.

Top meaning the current subxact or the main xact?
If main, why are do we need XactTopTransactionId

Current subxact.

TopTransactionStateData points to the top-level transaction data, but
XactTopTransactionId points to the current subxact.

So when you say "Only the top frame of the transaction state stack is
copied" you don't mean the top, you mean the bottom (the latest
subxact)? Which then becomes the top in the parallel worker? OK...

I initially thought of copying the entire TransactionStateData stack,
but Heikki suggested (via IM) that I do it this way instead. I
believe his concern was that it's never valid to commit or roll back
to a subtransaction that is not at the top of the stack, and if you
don't copy the stack, you avoid the risk of somehow ending up in that
state. Also, you avoid having to invent resource owners for
(sub)transactions that don't really exist in the current process. On
the other hand, you do end up with a few special cases that wouldn't
exist with the other approach. Still, I'm pretty happy to have taken
Heikki's advice: it was certainly simple to implement this way, plus
hopefully that way at least one person likes what I ended up with.
:-)

What else needs clarification?

Those comments really belong in a README, not the first visible
comment in xact.c

You need to start with the explanation that parallel workers have a
faked-up xact stack to make it easier to copy and manage. That is
valid because we never change xact state during a worker operation.

I get it now and agree, but please work some more on clarity of
README, comments and variable naming!

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Robert Haas (#14)
Re: parallel mode and parallel contexts

On 1/6/15, 10:33 AM, Robert Haas wrote:

Entrypints?

Already noted by Andres; fixed in the attached version.

Perhaps we only parallelize while drinking beer... ;)

CreateParallelContext(): Does it actually make sense to have nworkers=0? ISTM that's a bogus case. Also, since the number of workers will normally be determined dynamically by the planner, should that check be a regular conditional instead of just an Assert?

In LaunchParallelWorkers() the "Start Workers" comment states that we give up registering more workers if one fails to register, but there's nothing in the if condition to do that, and I don't see RegisterDynamicBackgroundWorker() doing it either. Is the comment just incorrect?

SerializeTransactionState(): instead of looping through the transaction stack to calculate nxids, couldn't we just set it to maxsize - sizeof(TransactionId) * 3? If we're looping a second time for safety a comment explaining that would be useful...

sequence.c: Is it safe to read a sequence value in a parallel worker if the cache_value is > 1?

This may be a dumb question, but for functions do we know that all pl's besides C and SQL use SPI? If not I think they could end up writing in a worker.

@@ -2968,7 +2969,8 @@ ProcessInterrupts(void)
  					 errmsg("canceling statement due to user request")));
  		}
  	}
-	/* If we get here, do nothing (probably, QueryCancelPending was reset) */
+	if (ParallelMessagePending)
+		HandleParallelMessageInterrupt(false);
ISTM it'd be good to leave that comment in place (after the if).

RestoreComboCIDState(): Assert(!comboCids || !comboHash): Shouldn't that be &&? AIUI both should always be either set or 0.

Typo: + elog(ERROR, "cannot update SecondarySnapshpt during a parallel operation");

The comment for RestoreSnapshot refers to set_transaction_snapshot, but that doesn't actually exist (or isn't referenced).
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#17)
Re: parallel mode and parallel contexts

On 6 January 2015 at 21:37, Simon Riggs <simon@2ndquadrant.com> wrote:

I get it now and agree

Yes, very much.

Should we copy both the top-level frame and the current subxid?
Hot Standby links subxids directly to the top-level, so this would be similar.

If we copied both, we wouldn't need to special case the Get functions.
It would still be O(1).

but please work some more on clarity of
README, comments and variable naming!

Something other than "top" sounds good.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#17)
Re: parallel mode and parallel contexts

On Tue, Jan 6, 2015 at 4:37 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

So when you say "Only the top frame of the transaction state stack is
copied" you don't mean the top, you mean the bottom (the latest
subxact)? Which then becomes the top in the parallel worker? OK...

The item most recently added to the stack is properly called the top,
but I guess it's confusing in this case because the item on the bottom
of the stack is referred to as the TopTransaction. I'll see if I can
rephrase that.

Those comments really belong in a README, not the first visible
comment in xact.c

OK.

You need to start with the explanation that parallel workers have a
faked-up xact stack to make it easier to copy and manage. That is
valid because we never change xact state during a worker operation.

OK.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#20)
#22Robert Haas
robertmhaas@gmail.com
In reply to: Jim Nasby (#18)
#23Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#21)
#24Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Robert Haas (#22)
#25Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#23)
#26Robert Haas
robertmhaas@gmail.com
In reply to: Jim Nasby (#24)
#27Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#25)
#28Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#27)
#29Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#28)
#30Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#29)
#31Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#30)
#32Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#10)
#33Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#32)
#34Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#33)
#35Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#34)
#36Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#35)
#37Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#36)
#38Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#37)
#39Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#33)
#40Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Robert Haas (#39)
#41Robert Haas
robertmhaas@gmail.com
In reply to: Jim Nasby (#40)
#42Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#39)
#43Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#42)
#44Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#43)
#45Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#44)
#46Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#45)
#47Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#46)
#48Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#47)
#49Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#48)
#50Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#48)
#51Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#50)
#52Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#48)
#53Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#51)
#54Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#53)
#55Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#52)
#56Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#55)
#57Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#56)
#58Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#57)
#59Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#57)
#60Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#58)
#61Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#59)
#62Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#60)
#63Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#62)
#64Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#62)
#65Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#61)
In reply to: Robert Haas (#60)
#67Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#62)
#68Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#65)
#69Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#66)
#70Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#67)
In reply to: Robert Haas (#60)
#72Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Robert Haas (#68)
#73Robert Haas
robertmhaas@gmail.com
In reply to: Euler Taveira de Oliveira (#71)
#74Robert Haas
robertmhaas@gmail.com
In reply to: Amit Langote (#72)
#75Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#73)
#76Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#75)
#77Noah Misch
noah@leadboat.com
In reply to: Robert Haas (#67)
#78Robert Haas
robertmhaas@gmail.com
In reply to: Noah Misch (#77)
#79Noah Misch
noah@leadboat.com
In reply to: Robert Haas (#78)
#80Robert Haas
robertmhaas@gmail.com
In reply to: Noah Misch (#79)