Parallel Sort

Started by Noah Mischabout 13 years ago34 messageshackers

noah@leadboat.com

about 13 years ago

It would be great if one client session could take advantage of multiple CPU
cores. EnterpriseDB wishes to start the trek into this problem space for 9.4
by implementing parallel internal (i.e. not spilling to disk) sort. This
touches on a notable subset of the infrastructure components we'll need for
parallel general query. My intent is to map out the key design topics, hear
about critical topics I hadn't considered, and solicit feedback on the quality
of the high-level plan. Full designs for key pieces will come later.

* Worker Backends

A multi-process model, rather than a multi-thread, is best for introducing
parallelism to a PostgreSQL session. With threading, all in-memory state
would become shared by default; with processes, in-memory data structures
remain unshared until we explicitly select them for sharing. Some data
structures will require new sharing, but I expect unshared ones to remain the
typical case. I refer to these additional processes as worker backends. They
will have much in common with ordinary client backends, including a database
connection and a PGPROC. They will have no client socket; instead, they will
take direction from the client-connected backend, termed the master, via
shared memory.

Each worker needs to make SnapshotNow visibility decisions coherent with the
master. For sorting, this allows us to look up comparison functions, even
when the current transaction created or modified those functions. This will
also be an essential building block for any parallelism project that consults
user tables. Implementing this means copying the subtransaction stack and the
combocid hash to each worker. For the sake of completeness, we should also
copy the global MVCC snapshot data (sorting probably won't care). It also
means forbidding, while a parallel task is in flight, operations that affect
the transaction state:

- CommandCounterIncrement()
- GetCurrentCommandId(true)
- AssignTransactionId()
- subtransaction-management functions

I don't intend to commit us to a design fundamentally antagonistic to, for
example, starting a subtransaction in a worker backend. That being said, I
think we could achieve a lot in the parallel query space before we would deem
it important to relax those restrictions.

Workers will copy all GUCs from the master. bttextcmp() needs lc_collate, and
we'll inevitably reach the point of needing broad GUC coverage as we make more
things run in parallel. Further GUC changes will be forbidden while a
parallel task is in flight. (More generally, anything that mutates
backend-local state without reference to shared state will either need to be
forbidden or converted to route through shared state.)

The heavyweight locking mechanism will need to be aware of the association
between the master and its workers. For sorting, workers will acquire
heavyweight locks on system catalogs only. It is almost enough for the
workers to behave as though they are independent of the master for locking
purposes. But an undetected deadlock would arise if the master holds an
AccessExclusiveLock on a system catalog a worker needs.

We can allocate a small amount of permanent shared memory for coordination
among a group of processes, but sorting will benefit from a region as large as
maintenance_work_mem. Expect on-demand memory sharing.

* Identifying Parallel-Compatible Functions

Not all functions can reasonably run on a worker backend. We should not
presume that a VOLATILE function can tolerate the unstable execution order
imposed by parallelism, though a function like clock_timestamp() is perfectly
reasonable to run that way. STABLE does not have that problem, but neither
does it constitute a promise that the function implementation is compatible
with parallel execution. Consider xid_age(), which would need code changes to
operate correctly in parallel. IMMUTABLE almost guarantees enough; there may
come a day when all IMMUTABLE functions can be presumed parallel-safe. For
now, an IMMUTABLE function could cause trouble by starting a (read-only)
subtransaction. The bottom line is that parallel-compatibility needs to be
separate from volatility classes for the time being.

I'm not sure what the specific answer here should look like. Simply having a
CREATE FUNCTION ... PARALLEL_IS_FINE flag is not entirely satisfying, because
the rules are liable to loosen over time.

* Planner & Similar Issues

We should decide whether to actually sort in parallel based on the comparator
cost and the data size. The system currently has no information on comparator
cost: bt*cmp (and indeed almost all built-in functions) all have procost=1,
but bttextcmp is at least 1000x slower than btint4cmp. Let's improve the
procost estimates of all core B-tree and hash operators. This will have other
benefits, but we will need to be cognizant of the risk of upsetting setups
that have tuned cpu_operator_cost based on the present situation.

The choice of whether to parallelize can probably be made a manner similar to
the choice to do an external sort: the planner guesses the outcome for costing
purposes, but the actual decision is made at execution time. The planner
would determine a tuple count cutoff at which parallelism becomes favorable,
and tuplesort would check that to establish its actual decision.

Use of parallelism within one process has a distributed effect on the system,
similar to the use of work_mem. Add a PGC_USERSET GUC representing the number
of worker processes the current session is willing to use. It would default
to a small number, say zero or two. On a throughput-oriented system with high
concurrent query counts, the DBA would tend to set/leave it at zero in
postgresql.conf; parallelism can only help if the system has free resources.
Separately, add a GUC limiting the total number of workers across the system
at any one time.

Thanks,
nm

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@anarazel.de

about 13 years ago

In reply to: Noah Misch (#1)

Re: Parallel Sort

Hi,

Interesting! Need to think about most, but one piece immediately came to
mind:

On 2013-05-13 10:28:59 -0400, Noah Misch wrote:

Each worker needs to make SnapshotNow visibility decisions coherent with the
master. For sorting, this allows us to look up comparison functions, even
when the current transaction created or modified those functions.

I don't really see how you can achieve that given how SnapshotNow
works. There's nothing protecting you against one backend seeing changes
made by another transaction while another doesn't see them. SnapshotNow
doesn't even guarantee consistency within a single backend during a
single scan...
If you are meaning the above to just apply to changes made by the local
"master" backend, sure I can see that.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

about 13 years ago

In reply to: Noah Misch (#1)

Re: Parallel Sort

Noah Misch <noah@leadboat.com> writes:

Each worker needs to make SnapshotNow visibility decisions coherent with the
master. For sorting, this allows us to look up comparison functions, even
when the current transaction created or modified those functions. This will
also be an essential building block for any parallelism project that consults
user tables. Implementing this means copying the subtransaction stack and the
combocid hash to each worker.

[ ... and GUC settings, and who knows what else ... ]

This approach seems to me to be likely to guarantee that the startup
overhead for any parallel sort is so large that only fantastically
enormous sorts will come out ahead.

I think you need to think in terms of restricting the problem space
enough so that the worker startup cost can be trimmed to something
reasonable. One obvious suggestion is to forbid the workers from
doing any database access of their own at all --- the parent would
have to do any required catalog lookups for sort functions etc.
before forking the children.

I think we should also seriously think about relying on fork() and
copy-on-write semantics to launch worker subprocesses, instead of
explicitly copying so much state over to them. Yes, this would
foreclose ever having parallel query on Windows, but that's okay
with me (hm, now where did I put my asbestos longjohns ...)

Both of these lines of thought suggest that the workers should *not*
be full-fledged backends.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@anarazel.de

about 13 years ago

In reply to: Tom Lane (#3)

Re: Parallel Sort

On 2013-05-13 10:57:39 -0400, Tom Lane wrote:

Noah Misch <noah@leadboat.com> writes:

Each worker needs to make SnapshotNow visibility decisions coherent with the
master. For sorting, this allows us to look up comparison functions, even
when the current transaction created or modified those functions. This will
also be an essential building block for any parallelism project that consults
user tables. Implementing this means copying the subtransaction stack and the
combocid hash to each worker.

[ ... and GUC settings, and who knows what else ... ]

This approach seems to me to be likely to guarantee that the startup
overhead for any parallel sort is so large that only fantastically
enormous sorts will come out ahead.

I think if this is the way to go - and I am not sure it is - we need to
use some worker pool that then are (re-)used everytime someone needs to
do a sort. Which would be easier if backends could switch databases...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas

robertmhaas@gmail.com

about 13 years ago

In reply to: Tom Lane (#3)

Re: Parallel Sort

On Mon, May 13, 2013 at 10:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

This approach seems to me to be likely to guarantee that the startup
overhead for any parallel sort is so large that only fantastically
enormous sorts will come out ahead.

I think you need to think in terms of restricting the problem space
enough so that the worker startup cost can be trimmed to something
reasonable. One obvious suggestion is to forbid the workers from
doing any database access of their own at all --- the parent would
have to do any required catalog lookups for sort functions etc.
before forking the children.

I think we should also seriously think about relying on fork() and
copy-on-write semantics to launch worker subprocesses, instead of
explicitly copying so much state over to them. Yes, this would
foreclose ever having parallel query on Windows, but that's okay
with me (hm, now where did I put my asbestos longjohns ...)

Both of these lines of thought suggest that the workers should *not*
be full-fledged backends.

Eventually, PostgreSQL needs not only parallel sort, but a more
general parallel query facility. The goal here is not to design
something specific to parallel sort, but to provide a general
infrastructure for server-side parallelism. If we restrict ourselves
to a design where syscache lookups aren't possible from a worker
backend, I have trouble seeing how that's ever gonna work. That's a
very low-level facility that a lot of things rely on. Even if you
could make the shuttling of requests between master and slave
transparent to the backend code, it's cutting into the amount of
actual parallelizable stuff, and adding very significantly to the
overhead.

I don't see any reason to panic about the worker startup cost. I
don't know whether the stuff that Noah mentioned copying will take 10
microseconds or 100 milliseconds, but there are plenty of sorts that
take large numbers of seconds or minutes to happen, so there's still
plenty of opportunity for win there. By definition, the things that
you want to run in parallel are the ones that take a long time if you
don't run them in parallel. Now, of course, if we can reduce the cost
of starting new backends (e.g. by keeping them around from one
parallel operation to the next, or by starting them via fork), that
will expand the range of cases where parallelism is a win. But I
think we could win in plenty of interesting real-world cases even if
it took a full second to initialize each new worker, and surely it
won't be nearly that bad.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Noah Misch

noah@leadboat.com

about 13 years ago

In reply to: Andres Freund (#2)

Re: Parallel Sort

On Mon, May 13, 2013 at 04:39:01PM +0200, Andres Freund wrote:

On 2013-05-13 10:28:59 -0400, Noah Misch wrote:

Each worker needs to make SnapshotNow visibility decisions coherent with the
master. For sorting, this allows us to look up comparison functions, even
when the current transaction created or modified those functions.

I don't really see how you can achieve that given how SnapshotNow
works. There's nothing protecting you against one backend seeing changes
made by another transaction while another doesn't see them. SnapshotNow
doesn't even guarantee consistency within a single backend during a
single scan...
If you are meaning the above to just apply to changes made by the local
"master" backend, sure I can see that.

Yes; it only makes the workers consistent with the master to the extent that
the master is consistent with itself. However, your comment makes me see that
this may not be enough. For an object not protected by locks, if parallel
execution probes the syscache N times where the current code probes it only
once, we'll introduce new anomalies. I don't think this affects sorting in
particular, because we already make little effort to behave sanely when a
comparison function is redefined mid-sort. It seems likely that this will
need a better answer sooner or later as we move into the parallelism space.

Thanks,
nm

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Simon Riggs

simon@2ndQuadrant.com

about 13 years ago

In reply to: Noah Misch (#1)

Re: Parallel Sort

On 13 May 2013 15:28, Noah Misch <noah@leadboat.com> wrote:

The heavyweight locking mechanism will need to be aware of the association
between the master and its workers.

Not sure I can see why that would be true.

ISTM that the workers need to be restricted in various ways from a
full-functioned master. If the workers can do things that conflict
with the master and/or each other, we're going to have all kinds of
hurt.

The key is going to be deciding what the restrictions are and then
working out how to enforce them.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

KaiGai Kohei

kaigai@ak.jp.nec.com

about 13 years ago

In reply to: Noah Misch (#1)

Re: Parallel Sort

2013/5/13 Noah Misch <noah@leadboat.com>

* Planner & Similar Issues

We should decide whether to actually sort in parallel based on the
comparator
cost and the data size. The system currently has no information on
comparator
cost: bt*cmp (and indeed almost all built-in functions) all have procost=1,
but bttextcmp is at least 1000x slower than btint4cmp. Let's improve the
procost estimates of all core B-tree and hash operators. This will have
other
benefits, but we will need to be cognizant of the risk of upsetting setups
that have tuned cpu_operator_cost based on the present situation.

The choice of whether to parallelize can probably be made a manner similar
to
the choice to do an external sort: the planner guesses the outcome for
costing
purposes, but the actual decision is made at execution time. The planner
would determine a tuple count cutoff at which parallelism becomes
favorable,
and tuplesort would check that to establish its actual decision.

It probably crossovers my problem consciousness to off-load CPU bounds

workloads; that I partially tried to implement on writable foreign table
feature.
Not only sorting stuff, I think it may be worthful to have capability to
push
heavy workload (like sort, aggregate or complex target-list) out external
computing resources.
However, I doubt whether the decision to parallelize should be done in
execution time, rather than plan stage. For example, in case when we
have enough number of records and 10-core multiprocessor, the wise
plan may take parallel data load by 10-processors, partial-sort by 10-
processors individually, then merge-sort. It needs fundamental different
tree structure from the traditional single-processors based plan-tree.
So, it seems to me we should take an enhancement to allow to inject
plan-tree special purpose parallel processing plan node.
How about your opinion?

Thanks,

Simon Riggs

simon@2ndQuadrant.com

about 13 years ago

In reply to: Tom Lane (#3)

Re: Parallel Sort

On 13 May 2013 15:57, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think you need to think in terms of restricting the problem space

One obvious suggestion is to forbid the workers from
doing any database access of their own at all --- the parent would
have to do any required catalog lookups for sort functions etc.

I think we should also seriously think about relying on fork() and
copy-on-write semantics to launch worker subprocesses, instead of
explicitly copying so much state over to them. Yes, this would
foreclose ever having parallel query on Windows, but that's okay
with me (hm, now where did I put my asbestos longjohns ...)

If we relied on some kind of inherited state we could easily make the
mistake of relying on something that isn't actually being maintained
correctly in the worker. Luckily (?) that is exactly the stance we
need to make this work on Windows. Other than that, releasing on
Windows in later release sounds sensible, otherwise we'll just delay
the development so much it will still happen in the "later" timeframe,
just the chance of an earlier release on Linux/BSD will be missed.

For example, the idea of managing separate subtransactions in each
worker sounds nuts. Impressive, if you're already thinking about
parallel DML that can self recover halfway through a statement and
then continue processing, but that's a little advanced. The way to
think about this is as a 10 year journey, not as a single feature.

-1 for forking

Both of these lines of thought suggest that the workers should *not*
be full-fledged backends.

+1 to the idea of workers != masters

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Michael Paquier

michael@paquier.xyz

about 13 years ago

In reply to: Noah Misch (#1)

Re: Parallel Sort

On Mon, May 13, 2013 at 11:28 PM, Noah Misch <noah@leadboat.com> wrote:

* Identifying Parallel-Compatible Functions

Not all functions can reasonably run on a worker backend. We should not
presume that a VOLATILE function can tolerate the unstable execution order
imposed by parallelism, though a function like clock_timestamp() is
perfectly
reasonable to run that way. STABLE does not have that problem, but neither
does it constitute a promise that the function implementation is compatible
with parallel execution. Consider xid_age(), which would need code
changes to
operate correctly in parallel. IMMUTABLE almost guarantees enough; there
may
come a day when all IMMUTABLE functions can be presumed parallel-safe. For
now, an IMMUTABLE function could cause trouble by starting a (read-only)
subtransaction. The bottom line is that parallel-compatibility needs to be
separate from volatility classes for the time being.

I am not sure that this problem is only limited to functions, but to all
the expressions
and clauses of queries that could be shipped and evaluated on the worker
backends when
fetching tuples that could be used to accelerate a parallel sort. Let's
imagine for example
the case of a LIMIT clause that can be used by worker backends to limit the
number of tuples
to sort as final result.
In some ways, Postgres-XC has faced (and is still facing) similar
challenges and they have
been partially solved.

I'm not sure what the specific answer here should look like. Simply having

a
CREATE FUNCTION ... PARALLEL_IS_FINE flag is not entirely satisfying,
because
the rules are liable to loosen over time.

Having a flag would be enough to control parallelism, but cannot we also
determine if
the execution of a function can be shipped safely to a worker based on its
volatility
only? Immutable functions are presumably safe as they do not modify the
database state
and give always the same result, volatile and stable functions are
definitely not safe.
For such reasons, it would be better to keep things simple and rely on
simple rules to
determine if a given expression can be executed safely on a backend worker.
--
Michael

#11

Robert Haas

robertmhaas@gmail.com

about 13 years ago

In reply to: Michael Paquier (#10)

Re: Parallel Sort

On Tue, May 14, 2013 at 12:51 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

I'm not sure what the specific answer here should look like. Simply
having a
CREATE FUNCTION ... PARALLEL_IS_FINE flag is not entirely satisfying,
because
the rules are liable to loosen over time.

Having a flag would be enough to control parallelism, but cannot we also
determine if
the execution of a function can be shipped safely to a worker based on its
volatility
only? Immutable functions are presumably safe as they do not modify the
database state
and give always the same result, volatile and stable functions are
definitely not safe.
For such reasons, it would be better to keep things simple and rely on
simple rules to
determine if a given expression can be executed safely on a backend worker.

In the part of the text you didn't quote, Noah explained why not all
immutable functions are parallel-safe.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Noah Misch

noah@leadboat.com

about 13 years ago

In reply to: Simon Riggs (#7)

Re: Parallel Sort

On Mon, May 13, 2013 at 07:55:16PM +0100, Simon Riggs wrote:

On 13 May 2013 15:28, Noah Misch <noah@leadboat.com> wrote:

The heavyweight locking mechanism will need to be aware of the association
between the master and its workers.

Not sure I can see why that would be true.

ISTM that the workers need to be restricted in various ways from a
full-functioned master. If the workers can do things that conflict
with the master and/or each other, we're going to have all kinds of
hurt.

The key is going to be deciding what the restrictions are and then
working out how to enforce them.

Yes, specific lock manager needs will depend on what we permit workers to do.

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Noah Misch

noah@leadboat.com

about 13 years ago

In reply to: KaiGai Kohei (#8)

Re: Parallel Sort

On Mon, May 13, 2013 at 09:52:43PM +0200, Kohei KaiGai wrote:

2013/5/13 Noah Misch <noah@leadboat.com>

The choice of whether to parallelize can probably be made a manner similar
to
the choice to do an external sort: the planner guesses the outcome for
costing
purposes, but the actual decision is made at execution time. The planner
would determine a tuple count cutoff at which parallelism becomes
favorable,
and tuplesort would check that to establish its actual decision.

It probably crossovers my problem consciousness to off-load CPU bounds
workloads; that I partially tried to implement on writable foreign table
feature.
Not only sorting stuff, I think it may be worthful to have capability to
push
heavy workload (like sort, aggregate or complex target-list) out external
computing resources.
However, I doubt whether the decision to parallelize should be done in
execution time, rather than plan stage. For example, in case when we
have enough number of records and 10-core multiprocessor, the wise
plan may take parallel data load by 10-processors, partial-sort by 10-
processors individually, then merge-sort. It needs fundamental different
tree structure from the traditional single-processors based plan-tree.

That's taking a few steps more in the direction of parallel general query; at
some point, the planner would definitely become parallel-aware. For the
narrower topic of parallel sort, I don't think it's necessary. The node tree
supplying the sort can't run in parallel (yet), and the node pulling from the
sort won't receive the first tuple until the sort is complete. For the time
being, the planner just needs to know enough about the sort node's projected
use of parallelism to estimate cost.

So, it seems to me we should take an enhancement to allow to inject
plan-tree special purpose parallel processing plan node.
How about your opinion?

I'm not picturing how, specifically, this new node or class of nodes would be
used. Could you elaborate?

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Noah Misch

noah@leadboat.com

about 13 years ago

In reply to: Michael Paquier (#10)

Re: Parallel Sort

On Tue, May 14, 2013 at 01:51:42PM +0900, Michael Paquier wrote:

On Mon, May 13, 2013 at 11:28 PM, Noah Misch <noah@leadboat.com> wrote:

* Identifying Parallel-Compatible Functions

Not all functions can reasonably run on a worker backend. We should not
presume that a VOLATILE function can tolerate the unstable execution order
imposed by parallelism, though a function like clock_timestamp() is
perfectly
reasonable to run that way. STABLE does not have that problem, but neither
does it constitute a promise that the function implementation is compatible
with parallel execution. Consider xid_age(), which would need code
changes to
operate correctly in parallel. IMMUTABLE almost guarantees enough; there
may
come a day when all IMMUTABLE functions can be presumed parallel-safe. For
now, an IMMUTABLE function could cause trouble by starting a (read-only)
subtransaction. The bottom line is that parallel-compatibility needs to be
separate from volatility classes for the time being.

I am not sure that this problem is only limited to functions, but to all
the expressions
and clauses of queries that could be shipped and evaluated on the worker
backends when
fetching tuples that could be used to accelerate a parallel sort. Let's
imagine for example
the case of a LIMIT clause that can be used by worker backends to limit the
number of tuples
to sort as final result.

It's true that the same considerations apply to other plan tree constructs;
however, every such construct is known at build time, so we can study each one
and decide how it fits with parallelism. Since functions are user-definable,
it's preferable to reason about classes of functions.

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Claudio Freire

klaussfreire@gmail.com

about 13 years ago

In reply to: Noah Misch (#13)

Re: Parallel Sort

On Tue, May 14, 2013 at 11:50 AM, Noah Misch <noah@leadboat.com> wrote:

On Mon, May 13, 2013 at 09:52:43PM +0200, Kohei KaiGai wrote:

2013/5/13 Noah Misch <noah@leadboat.com>

The choice of whether to parallelize can probably be made a manner similar
to
the choice to do an external sort: the planner guesses the outcome for
costing
purposes, but the actual decision is made at execution time. The planner
would determine a tuple count cutoff at which parallelism becomes
favorable,
and tuplesort would check that to establish its actual decision.

It probably crossovers my problem consciousness to off-load CPU bounds
workloads; that I partially tried to implement on writable foreign table
feature.
Not only sorting stuff, I think it may be worthful to have capability to
push
heavy workload (like sort, aggregate or complex target-list) out external
computing resources.
However, I doubt whether the decision to parallelize should be done in
execution time, rather than plan stage. For example, in case when we
have enough number of records and 10-core multiprocessor, the wise
plan may take parallel data load by 10-processors, partial-sort by 10-
processors individually, then merge-sort. It needs fundamental different
tree structure from the traditional single-processors based plan-tree.

That's taking a few steps more in the direction of parallel general query; at
some point, the planner would definitely become parallel-aware. For the
narrower topic of parallel sort, I don't think it's necessary. The node tree
supplying the sort can't run in parallel (yet), and the node pulling from the
sort won't receive the first tuple until the sort is complete. For the time
being, the planner just needs to know enough about the sort node's projected
use of parallelism to estimate cost.

You know what would be a low-hanging fruit that I've been thinking
would benefit many of my own queries?

"Parallel" sequential scan nodes. Even if there's no real parallelism
involved, when a query has to scan the same table at multiple nodes,
if it's big, it would be worth parallelizing the scans to transform
them into synchro scans.

I have absolutely no idea how this would work easily without forked
workers, because the scans might be buried in more complex execution
trees. But still, it's worth considering, that parallelizing may
benefit more than core usage.

If execution nodes could be paused at arbitrary points, a "parallel
scan" node could pause one branch that has consumed the circular
buffer, letting another branches consume their part, and thus
"parallelizing" branch execution. But this would be perhaps more
complex than simply forking.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Michael Paquier

michael@paquier.xyz

about 13 years ago

In reply to: Noah Misch (#14)

Re: Parallel Sort

On Tue, May 14, 2013 at 11:59 PM, Noah Misch <noah@leadboat.com> wrote:

On Tue, May 14, 2013 at 01:51:42PM +0900, Michael Paquier wrote:

On Mon, May 13, 2013 at 11:28 PM, Noah Misch <noah@leadboat.com> wrote:

* Identifying Parallel-Compatible Functions

Not all functions can reasonably run on a worker backend. We should

not

presume that a VOLATILE function can tolerate the unstable execution

order

imposed by parallelism, though a function like clock_timestamp() is
perfectly
reasonable to run that way. STABLE does not have that problem, but

neither

does it constitute a promise that the function implementation is

compatible

with parallel execution. Consider xid_age(), which would need code
changes to
operate correctly in parallel. IMMUTABLE almost guarantees enough;

there

may
come a day when all IMMUTABLE functions can be presumed parallel-safe.

For

now, an IMMUTABLE function could cause trouble by starting a

(read-only)

subtransaction. The bottom line is that parallel-compatibility needs

to be

separate from volatility classes for the time being.

I am not sure that this problem is only limited to functions, but to all
the expressions
and clauses of queries that could be shipped and evaluated on the worker
backends when
fetching tuples that could be used to accelerate a parallel sort. Let's
imagine for example
the case of a LIMIT clause that can be used by worker backends to limit

the

number of tuples
to sort as final result.

It's true that the same considerations apply to other plan tree constructs;
however, every such construct is known at build time, so we can study each
one
and decide how it fits with parallelism.

The concept of clause parallelism for backend worker is close to the
concept of clause shippability introduced in Postgres-XC. In the case of
XC, the equivalent of the master backend is a backend located on a node
called Coordinator that merges and organizes results fetched in parallel
from remote nodes where data scans occur (on nodes called Datanodes). The
backends used for tuple scans across Datanodes share the same data
visibility as they use the same snapshot and transaction ID as the backend
on Coordinator. This is different from the parallelism as there is no idea
of snapshot import to worker backends.

However, the code in XC planner used for clause shippability evaluation is
definitely worth looking at just considering the many similarities it
shares with parallelism when evaluating if a given clause can be executed
on a worker backend or not. It would be a waste to implement twice the same
thing is there is code already available.

Since functions are user-definable, it's preferable to reason about
classes of functions.

Yes. You are right.
--
Michael

#17

Hannu Krosing

hannu@tm.ee

about 13 years ago

In reply to: Robert Haas (#5)

Re: Parallel Sort

On 05/13/2013 07:10 PM, Robert Haas wrote:

On Mon, May 13, 2013 at 10:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

This approach seems to me to be likely to guarantee that the startup
overhead for any parallel sort is so large that only fantastically
enormous sorts will come out ahead.

I think you need to think in terms of restricting the problem space
enough so that the worker startup cost can be trimmed to something
reasonable. One obvious suggestion is to forbid the workers from
doing any database access of their own at all --- the parent would
have to do any required catalog lookups for sort functions etc.
before forking the children.

I think we should also seriously think about relying on fork() and
copy-on-write semantics to launch worker subprocesses, instead of
explicitly copying so much state over to them. Yes, this would
foreclose ever having parallel query on Windows, but that's okay
with me (hm, now where did I put my asbestos longjohns ...)

Both of these lines of thought suggest that the workers should *not*
be full-fledged backends.

Eventually, PostgreSQL needs not only parallel sort, but a more
general parallel query facility. The goal here is not to design
something specific to parallel sort, but to provide a general
infrastructure for server-side parallelism. If we restrict ourselves
to a design where syscache lookups aren't possible from a worker
backend, I have trouble seeing how that's ever gonna work. That's a
very low-level facility that a lot of things rely on. Even if you
could make the shuttling of requests between master and slave
transparent to the backend code, it's cutting into the amount of
actual parallelizable stuff, and adding very significantly to the
overhead.

Has anybody looked into making syscache MVCC compliant ?

A lot of complexity would be avoided if there were some efficient
way to have the same MVCC power in syscache as there is in the rest
of the system.

SO instead of solving the cache transactional
coherency problems separately in each user of syscache why not
bring the syscache to the level of rest of postgres ?

I don't see any reason to panic about the worker startup cost. I
don't know whether the stuff that Noah mentioned copying will take 10
microseconds or 100 milliseconds, but there are plenty of sorts that
take large numbers of seconds or minutes to happen, so there's still
plenty of opportunity for win there. By definition, the things that
you want to run in parallel are the ones that take a long time if you
don't run them in parallel. Now, of course, if we can reduce the cost
of starting new backends (e.g. by keeping them around from one
parallel operation to the next, or by starting them via fork), that
will expand the range of cases where parallelism is a win. But I
think we could win in plenty of interesting real-world cases even if
it took a full second to initialize each new worker, and surely it
won't be nearly that bad.

+1 to there being lots of use cases for high startup cost parallelism.

There are lots of non-parallel query plans as well which have high startup
cost and postgresql optimiser already deals with them quite well.

Hannu

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Tom Lane

tgl@sss.pgh.pa.us

about 13 years ago

In reply to: Hannu Krosing (#17)

Re: Parallel Sort

Hannu Krosing <hannu@krosing.net> writes:

Has anybody looked into making syscache MVCC compliant ?

This is the wrong statement of the question.

The right statement is "how would you like backend A to be updating
table T in compliance with a view of table T's schema that is obsolete
because of backend B's already-committed DDL changes?" For example,
ignoring a just-committed CHECK constraint because it's not visible
to your transaction's snapshot.

The point of SnapshotNow catalog lookups is to be sure that we see the
current definition of a table whether or not we're in a transaction
whose snapshot precedes the last DDL change to that table. There might
be some other way to deal with that consistency problem, but just
changing syscache's behavior is not going to make things better.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Hannu Krosing

hannu@tm.ee

about 13 years ago

In reply to: Tom Lane (#18)

Re: Parallel Sort

On 05/15/2013 07:30 AM, Tom Lane wrote:

Hannu Krosing <hannu@krosing.net> writes:

Has anybody looked into making syscache MVCC compliant ?

This is the wrong statement of the question.

The right statement is "how would you like backend A to be updating
table T in compliance with a view of table T's schema that is obsolete
because of backend B's already-committed DDL changes?"

Could we not treat it the same way we treat updated rows in
serializable transaction mode ? That is rollback and let the
client retry if it so wishes ?

For example, ignoring a just-committed CHECK constraint because it's not visible
to your transaction's snapshot.

Is there anything in SQL standard that tells us to handle schema changes
in parallel to DML in any other way than by rollback ?

I agree it is nice to have a way to carry on in this case, but there is
only
so much you can sensibly do. For example I can see no good way to handle
parallel DROP COLUMN, especially if you are trying to update that same
column
based on your old snapshot.

The point of SnapshotNow catalog lookups is to be sure that we see the
current definition of a table whether or not we're in a transaction
whose snapshot precedes the last DDL change to that table.

Can't we just detect that "there have been changes" and rollback if trying
any DML on changed tables. It does not seem like something what happens
too often.

There might
be some other way to deal with that consistency problem, but just
changing syscache's behavior is not going to make things better.

I was targeting mainly the read-only case of querying data from a changed
table. I have not checked how SnapshotNow catalog handle this, but I guess
it is not trivial in case there have been changes to catalog since the
active snapshot was taken.

--
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic Oï¿½

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Amit Kapila

amit.kapila16@gmail.com

about 13 years ago

In reply to: Noah Misch (#1)

Re: Parallel Sort

On Monday, May 13, 2013 7:59 PM Noah Misch wrote:

It would be great if one client session could take advantage of
multiple CPU
cores. EnterpriseDB wishes to start the trek into this problem space
for 9.4
by implementing parallel internal (i.e. not spilling to disk) sort.
This
touches on a notable subset of the infrastructure components we'll need
for
parallel general query. My intent is to map out the key design topics,
hear
about critical topics I hadn't considered, and solicit feedback on the
quality
of the high-level plan. Full designs for key pieces will come later.

* Worker Backends

A multi-process model, rather than a multi-thread, is best for
introducing
parallelism to a PostgreSQL session. With threading, all in-memory
state
would become shared by default; with processes, in-memory data
structures
remain unshared until we explicitly select them for sharing. Some data
structures will require new sharing, but I expect unshared ones to
remain the
typical case. I refer to these additional processes as worker
backends. They
will have much in common with ordinary client backends, including a
database
connection and a PGPROC. They will have no client socket; instead,
they will
take direction from the client-connected backend, termed the master,
via
shared memory.

We can allocate a small amount of permanent shared memory for
coordination
among a group of processes, but sorting will benefit from a region as
large as
maintenance_work_mem. Expect on-demand memory sharing.

Will the shared memory used for coordinating tuples between master and
worker be fixed or varying depending on size of tuples to be sorted or
number of workers associated.
If it is varying, then it can sometimes encounter situation where required
memory is not available and in that case it has to revert to serial sorting

* Identifying Parallel-Compatible Functions

Not all functions can reasonably run on a worker backend. We should
not
presume that a VOLATILE function can tolerate the unstable execution
order
imposed by parallelism, though a function like clock_timestamp() is
perfectly
reasonable to run that way. STABLE does not have that problem, but
neither
does it constitute a promise that the function implementation is
compatible
with parallel execution. Consider xid_age(), which would need code
changes to
operate correctly in parallel. IMMUTABLE almost guarantees enough;
there may
come a day when all IMMUTABLE functions can be presumed parallel-safe.
For
now, an IMMUTABLE function could cause trouble by starting a (read-
only)
subtransaction. The bottom line is that parallel-compatibility needs
to be
separate from volatility classes for the time being.

I'm not sure what the specific answer here should look like. Simply
having a
CREATE FUNCTION ... PARALLEL_IS_FINE flag is not entirely satisfying,
because
the rules are liable to loosen over time.

* Planner & Similar Issues

We should decide whether to actually sort in parallel based on the
comparator
cost and the data size. The system currently has no information on
comparator
cost: bt*cmp (and indeed almost all built-in functions) all have
procost=1,
but bttextcmp is at least 1000x slower than btint4cmp. Let's improve
the
procost estimates of all core B-tree and hash operators. This will
have other
benefits, but we will need to be cognizant of the risk of upsetting
setups
that have tuned cpu_operator_cost based on the present situation.

The choice of whether to parallelize can probably be made a manner
similar to
the choice to do an external sort: the planner guesses the outcome for
costing
purposes, but the actual decision is made at execution time. The
planner
would determine a tuple count cutoff at which parallelism becomes
favorable,
and tuplesort would check that to establish its actual decision.

Use of parallelism within one process has a distributed effect on the
system,
similar to the use of work_mem. Add a PGC_USERSET GUC representing the
number
of worker processes the current session is willing to use. It would
default
to a small number, say zero or two. On a throughput-oriented system
with high
concurrent query counts, the DBA would tend to set/leave it at zero in
postgresql.conf; parallelism can only help if the system has free
resources.
Separately, add a GUC limiting the total number of workers across the
system
at any one time.

How will the parallel sorting tasks be divided and assigned to each worker?

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Noah Misch

noah@leadboat.com

about 13 years ago

In reply to: Claudio Freire (#15)

#22

Noah Misch

noah@leadboat.com

about 13 years ago

In reply to: Michael Paquier (#16)

#23

Noah Misch