Parallel Queries and PostGIS

Started by Paul Ramseyalmost 10 years ago15 messages

pramsey@cleverelephant.ca

almost 10 years ago

I spent some time over the weekend trying out the different modes of
parallel query (seq scan, aggregate, join) in combination with PostGIS
and have written up the results here:

http://blog.cleverelephant.ca/2016/03/parallel-postgis.html

The TL:DR; is basically

* With some adjustments to function COST both parallel sequence scan
and parallel aggregation deliver very good parallel performance
results.
* The cost adjustments for sequence scan and aggregate scan are not
consistent in magnitude.
* Parallel join does not seem to work for PostGIS indexes yet, but
perhaps there is some magic to learn from PostgreSQL core on that.

The two findings at the end are ones that need input from parallel
query masters...

We recognize we'll have to adjust costs to that our particular use
case (very CPU-intensive calculation per function) is planned better,
but it seems like different query modes are interpreting costs in
order-of-magnitude different ways in building plans.

Parallel join would be a huge win, so some help/pointers on figuring
out why it's not coming into play when our gist operators are in
effect would be helpful.

Happy Easter to you all,
P

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Stephen Frost

sfrost@snowman.net

almost 10 years ago

In reply to: Paul Ramsey (#1)

Re: Parallel Queries and PostGIS

Paul,

* Paul Ramsey (pramsey@cleverelephant.ca) wrote:

I spent some time over the weekend trying out the different modes of
parallel query (seq scan, aggregate, join) in combination with PostGIS
and have written up the results here:

http://blog.cleverelephant.ca/2016/03/parallel-postgis.html

Neat!

Regarding aggregate parallelism and the cascaded union approach, though
I imagine in other cases as well, it seems like having a
"final-per-worker" function for aggregates would be useful.

Without actually looking at the code at all, it seems like that wouldn't
be terribly difficult to add.

Would you agree that it'd be helpful to have for making the st_union()
work better in parallel?

Though I do wonder if you would end up wanting to have a different
final() function in that case..

Thanks!

Stephen

Paul Ramsey

pramsey@cleverelephant.ca

almost 10 years ago

In reply to: Stephen Frost (#2)

Re: Parallel Queries and PostGIS

On Mon, Mar 28, 2016 at 9:45 AM, Stephen Frost <sfrost@snowman.net> wrote:

Paul,

* Paul Ramsey (pramsey@cleverelephant.ca) wrote:

I spent some time over the weekend trying out the different modes of
parallel query (seq scan, aggregate, join) in combination with PostGIS
and have written up the results here:

http://blog.cleverelephant.ca/2016/03/parallel-postgis.html

Neat!

Regarding aggregate parallelism and the cascaded union approach, though
I imagine in other cases as well, it seems like having a
"final-per-worker" function for aggregates would be useful.

Without actually looking at the code at all, it seems like that wouldn't
be terribly difficult to add.

Would you agree that it'd be helpful to have for making the st_union()
work better in parallel?

For our particular situation w/ ST_Union, yes, it would be ideal to be
able to run a worker-side combine function as well as the master-side
one. Although the cascaded union would be less effective spread out
over N nodes, doing it only once per worker, rather than every N
records would minimize the loss of effectiveness.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Paul Ramsey

pramsey@cleverelephant.ca

almost 10 years ago

In reply to: Paul Ramsey (#1)

Re: Parallel Queries and PostGIS

On Mon, Mar 28, 2016 at 9:18 AM, Paul Ramsey <pramsey@cleverelephant.ca> wrote:

Parallel join would be a huge win, so some help/pointers on figuring
out why it's not coming into play when our gist operators are in
effect would be helpful.

Robert, do you have any pointers on what I should look for to figure
out why the parallel join code doesn't fire if I add a GIST operator
to my join condition?

Thanks,

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas

robertmhaas@gmail.com

almost 10 years ago

In reply to: Paul Ramsey (#1)

Re: Parallel Queries and PostGIS

On Mon, Mar 28, 2016 at 12:18 PM, Paul Ramsey <pramsey@cleverelephant.ca> wrote:

I spent some time over the weekend trying out the different modes of
parallel query (seq scan, aggregate, join) in combination with PostGIS
and have written up the results here:

http://blog.cleverelephant.ca/2016/03/parallel-postgis.html

The TL:DR; is basically

* With some adjustments to function COST both parallel sequence scan
and parallel aggregation deliver very good parallel performance
results.
* The cost adjustments for sequence scan and aggregate scan are not
consistent in magnitude.
* Parallel join does not seem to work for PostGIS indexes yet, but
perhaps there is some magic to learn from PostgreSQL core on that.

The two findings at the end are ones that need input from parallel
query masters...

We recognize we'll have to adjust costs to that our particular use
case (very CPU-intensive calculation per function) is planned better,
but it seems like different query modes are interpreting costs in
order-of-magnitude different ways in building plans.

Parallel join would be a huge win, so some help/pointers on figuring
out why it's not coming into play when our gist operators are in
effect would be helpful.

First, I beg to differ with this statement: "Some of the execution
results output are wrong! They say that only 1844 rows were removed by
the filter, but in fact 7376 were (as we can confirm by running the
queries without the EXPLAIN ANALYZE). This is a known limitation,
reporting on the results of only one parallel worker, which (should)
maybe, hopefully be fixed before 9.6 comes out." The point is that
line has loops=4, so as in any other case where loops>1, you're seeing
the number of rows divided by the number of loops. It is the
*average* number of rows that were processed by each loop - one loop
per worker, in this case.

I am personally of the opinion that showing rowcounts divided by loops
instead of total rowcounts is rather stupid, and that we should change
it regardless. But it's not parallel query's fault, and changing it
would affect the output of every EXPLAIN ANALYZE involving a nested
loop, probably confusing a lot of people until they figured out what
we'd changed, after which - I *think* they'd realize that they
actually liked the new way much better.

Now, on to your actual question:

I have no idea why the cost adjustments that you need are different
for the scan case and the aggregate case. That does seem problematic,
but I just don't know why it's happening.

On the join case, I wonder if it's possible that _st_intersects is not
marked parallel-safe? If that's not the problem, I don't have a
second guess, but the thing to do would be to figure out whether
consider_parallel is false for the RelOptInfo corresponding to either
of pd and pts, or whether it's true for both but false for the
joinrel's RelOptInfo, or whether it's true for all three of them but
you don't get the desired path anyway.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Paul Ramsey

pramsey@cleverelephant.ca

almost 10 years ago

In reply to: Robert Haas (#5)

Re: Parallel Queries and PostGIS

First, I beg to differ with this statement: "Some of the execution
results output are wrong! ...." The point is that
line has loops=4, so as in any other case where loops>1, you're seeing
the number of rows divided by the number of loops. It is the
*average* number of rows that were processed by each loop - one loop
per worker, in this case.

Thanks for the explanation, let my reaction be a guide to what the
other unwashed will think :)

Now, on to your actual question:

I have no idea why the cost adjustments that you need are different
for the scan case and the aggregate case. That does seem problematic,
but I just don't know why it's happening.

What might be a good way to debug it? Is there a piece of code I can
look at to try and figure out the contribution of COST in either case?

On the join case, I wonder if it's possible that _st_intersects is not
marked parallel-safe? If that's not the problem, I don't have a
second guess, but the thing to do would be to figure out whether
consider_parallel is false for the RelOptInfo corresponding to either
of pd and pts, or whether it's true for both but false for the
joinrel's RelOptInfo, or whether it's true for all three of them but
you don't get the desired path anyway.

_st_intersects is definitely marked parallel safe, and in fact will
generate a parallel plan if used alone (without the operator though,
it's impossibly slow). It's the && operator that is the issue... and I
just noticed that the PROCEDURE bound to the && operator
(geometry_overlaps) is *not* marked parallel safe: could be the
problem?

Thanks,

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Paul Ramsey

pramsey@cleverelephant.ca

almost 10 years ago

In reply to: Paul Ramsey (#6)

Re: Parallel Queries and PostGIS

On Tue, Mar 29, 2016 at 12:48 PM, Paul Ramsey <pramsey@cleverelephant.ca> wrote:

On the join case, I wonder if it's possible that _st_intersects is not
marked parallel-safe? If that's not the problem, I don't have a
second guess, but the thing to do would be to figure out whether
consider_parallel is false for the RelOptInfo corresponding to either
of pd and pts, or whether it's true for both but false for the
joinrel's RelOptInfo, or whether it's true for all three of them but
you don't get the desired path anyway.

_st_intersects is definitely marked parallel safe, and in fact will
generate a parallel plan if used alone (without the operator though,
it's impossibly slow). It's the && operator that is the issue... and I
just noticed that the PROCEDURE bound to the && operator
(geometry_overlaps) is *not* marked parallel safe: could be the
problem?

Asked and answered: marking the geometry_overlaps as parallel safe
gets me a parallel plan! Now to play with costs and see how it behaves
when force_parallel_mode is not set.

Thanks,

P

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas

robertmhaas@gmail.com

almost 10 years ago

In reply to: Paul Ramsey (#6)

Re: Parallel Queries and PostGIS

On Tue, Mar 29, 2016 at 3:48 PM, Paul Ramsey <pramsey@cleverelephant.ca> wrote:

I have no idea why the cost adjustments that you need are different
for the scan case and the aggregate case. That does seem problematic,
but I just don't know why it's happening.

What might be a good way to debug it? Is there a piece of code I can
look at to try and figure out the contribution of COST in either case?

Well, the cost calculations are mostly in costsize.c, but I dunno how
much that helps. Maybe it would help if you posted some EXPLAIN
ANALYZE output for the different cases, with and without parallelism?

One thing I noticed about this output (from your blog)...

Finalize Aggregate
(cost=16536.53..16536.79 rows=1 width=8)
(actual time=2263.638..2263.639 rows=1 loops=1)
-> Gather
(cost=16461.22..16461.53 rows=3 width=32)
(actual time=754.309..757.204 rows=4 loops=1)
Number of Workers: 3
-> Partial Aggregate
(cost=15461.22..15461.23 rows=1 width=32)
(actual time=676.738..676.739 rows=1 loops=4)
-> Parallel Seq Scan on pd
(cost=0.00..13856.38 rows=64 width=2311)
(actual time=3.009..27.321 rows=42 loops=4)
Filter: (fed_num = 47005)
Rows Removed by Filter: 17341
Planning time: 0.219 ms
Execution time: 2264.684 ms

...is that the finalize aggregate phase is estimated to be very cheap,
but it's actually wicked expensive. We get the results from the
workers in only 750 ms, but it takes another second and a half to
aggregate those 4 rows???

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Paul Ramsey

pramsey@cleverelephant.ca

almost 10 years ago

In reply to: Robert Haas (#8)

Re: Parallel Queries and PostGIS

On Tue, Mar 29, 2016 at 1:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 29, 2016 at 3:48 PM, Paul Ramsey <pramsey@cleverelephant.ca> wrote:

I have no idea why the cost adjustments that you need are different
for the scan case and the aggregate case. That does seem problematic,
but I just don't know why it's happening.

What might be a good way to debug it? Is there a piece of code I can
look at to try and figure out the contribution of COST in either case?

Well, the cost calculations are mostly in costsize.c, but I dunno how
much that helps. Maybe it would help if you posted some EXPLAIN
ANALYZE output for the different cases, with and without parallelism?

One thing I noticed about this output (from your blog)...

Finalize Aggregate
(cost=16536.53..16536.79 rows=1 width=8)
(actual time=2263.638..2263.639 rows=1 loops=1)
-> Gather
(cost=16461.22..16461.53 rows=3 width=32)
(actual time=754.309..757.204 rows=4 loops=1)
Number of Workers: 3
-> Partial Aggregate
(cost=15461.22..15461.23 rows=1 width=32)
(actual time=676.738..676.739 rows=1 loops=4)
-> Parallel Seq Scan on pd
(cost=0.00..13856.38 rows=64 width=2311)
(actual time=3.009..27.321 rows=42 loops=4)
Filter: (fed_num = 47005)
Rows Removed by Filter: 17341
Planning time: 0.219 ms
Execution time: 2264.684 ms

...is that the finalize aggregate phase is estimated to be very cheap,
but it's actually wicked expensive. We get the results from the
workers in only 750 ms, but it takes another second and a half to
aggregate those 4 rows???

This is probably a vivid example of the bad behaviour of the naive
union approach. If we have worker states 1,2,3,4 and we go

combine(combine(combine(1,2),3),4)

Then we get kind of a worst case complexity situation where we three
times union an increasingly complex object on the left with a simpler
object on the right. Also, if the objects went into the transfer
functions in relatively non-spatially correlated order, the polygons
coming out of the transfer functions could be quite complex, and each
merge would only add complexity to the output until the final merge
which melts away all the remaining internal boundaries.

I'm surprised it's quite so awful at the end though, and less awful in
the worker stage... how do the workers end up getting rows to work on?
1,2,3,4,1,2,3,4,1,2,3,4? or 1,1,1,2,2,2,3,3,3,4,4,4? The former could
result in optimally inefficient unions, given a spatially correlated
input (surprisingly common in load-once GIS tables)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Paul Ramsey

pramsey@cleverelephant.ca

almost 10 years ago

In reply to: Paul Ramsey (#7)

Re: Parallel Queries and PostGIS

On Tue, Mar 29, 2016 at 12:51 PM, Paul Ramsey <pramsey@cleverelephant.ca> wrote:

On Tue, Mar 29, 2016 at 12:48 PM, Paul Ramsey <pramsey@cleverelephant.ca> wrote:

On the join case, I wonder if it's possible that _st_intersects is not
marked parallel-safe? If that's not the problem, I don't have a
second guess, but the thing to do would be to figure out whether
consider_parallel is false for the RelOptInfo corresponding to either
of pd and pts, or whether it's true for both but false for the
joinrel's RelOptInfo, or whether it's true for all three of them but
you don't get the desired path anyway.

_st_intersects is definitely marked parallel safe, and in fact will
generate a parallel plan if used alone (without the operator though,
it's impossibly slow). It's the && operator that is the issue... and I
just noticed that the PROCEDURE bound to the && operator
(geometry_overlaps) is *not* marked parallel safe: could be the
problem?

Asked and answered: marking the geometry_overlaps as parallel safe
gets me a parallel plan! Now to play with costs and see how it behaves
when force_parallel_mode is not set.

For the record I can get a non-forced parallel join plan, *only* if I
reduce the parallel_join_cost by a factor of 10, from 0.1 to 0.01.

http://blog.cleverelephant.ca/2016/03/parallel-postgis-joins.html

This seems non-optimal. No amount of cranking up the underlying
function COST seems to change this, perhaps because the join cost is
entirely based on the number of expected tuples in the join relation?

In general it seems like function COST values have been considered a
relatively unimportant input to planning in the past, but with
parallel processing it seems like they are now a lot more
determinative about what makes a good plan.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Amit Kapila

amit.kapila16@gmail.com

almost 10 years ago

In reply to: Paul Ramsey (#10)

Re: Parallel Queries and PostGIS

On Fri, Apr 1, 2016 at 12:49 AM, Paul Ramsey <pramsey@cleverelephant.ca>
wrote:

On Tue, Mar 29, 2016 at 12:51 PM, Paul Ramsey <pramsey@cleverelephant.ca>

wrote:

On Tue, Mar 29, 2016 at 12:48 PM, Paul Ramsey <pramsey@cleverelephant.ca>

wrote:

On the join case, I wonder if it's possible that _st_intersects is not
marked parallel-safe? If that's not the problem, I don't have a
second guess, but the thing to do would be to figure out whether
consider_parallel is false for the RelOptInfo corresponding to either
of pd and pts, or whether it's true for both but false for the
joinrel's RelOptInfo, or whether it's true for all three of them but
you don't get the desired path anyway.

_st_intersects is definitely marked parallel safe, and in fact will
generate a parallel plan if used alone (without the operator though,
it's impossibly slow). It's the && operator that is the issue... and I
just noticed that the PROCEDURE bound to the && operator
(geometry_overlaps) is *not* marked parallel safe: could be the
problem?

Asked and answered: marking the geometry_overlaps as parallel safe
gets me a parallel plan! Now to play with costs and see how it behaves
when force_parallel_mode is not set.

For the record I can get a non-forced parallel join plan, *only* if I
reduce the parallel_join_cost by a factor of 10, from 0.1 to 0.01.

I think here you mean parallel_tuple_cost.

http://blog.cleverelephant.ca/2016/03/parallel-postgis-joins.html

This seems non-optimal. No amount of cranking up the underlying
function COST seems to change this, perhaps because the join cost is
entirely based on the number of expected tuples in the join relation?

Is the function cost not being considered when given as join clause or you
wanted to point in general for any parallel plan it is not considered? I
think it should be considered when given as a clause for single table scan.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#12

David Rowley

david.rowley@2ndquadrant.com

almost 10 years ago

In reply to: Robert Haas (#8)

Re: Parallel Queries and PostGIS

On 30 March 2016 at 09:14, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 29, 2016 at 3:48 PM, Paul Ramsey <pramsey@cleverelephant.ca> wrote:

I have no idea why the cost adjustments that you need are different
for the scan case and the aggregate case. That does seem problematic,
but I just don't know why it's happening.

What might be a good way to debug it? Is there a piece of code I can
look at to try and figure out the contribution of COST in either case?

Well, the cost calculations are mostly in costsize.c, but I dunno how
much that helps. Maybe it would help if you posted some EXPLAIN
ANALYZE output for the different cases, with and without parallelism?

One thing I noticed about this output (from your blog)...

Finalize Aggregate
(cost=16536.53..16536.79 rows=1 width=8)
(actual time=2263.638..2263.639 rows=1 loops=1)
-> Gather
(cost=16461.22..16461.53 rows=3 width=32)
(actual time=754.309..757.204 rows=4 loops=1)
Number of Workers: 3
-> Partial Aggregate
(cost=15461.22..15461.23 rows=1 width=32)
(actual time=676.738..676.739 rows=1 loops=4)
-> Parallel Seq Scan on pd
(cost=0.00..13856.38 rows=64 width=2311)
(actual time=3.009..27.321 rows=42 loops=4)
Filter: (fed_num = 47005)
Rows Removed by Filter: 17341
Planning time: 0.219 ms
Execution time: 2264.684 ms

...is that the finalize aggregate phase is estimated to be very cheap,
but it's actually wicked expensive. We get the results from the
workers in only 750 ms, but it takes another second and a half to
aggregate those 4 rows???

hmm, actually I've just realised that create_grouping_paths() should
be accounting agg_costs differently depending if it's partial
aggregation, finalize aggregation, or just normal. count_agg_clauses()
needs to be passed the aggregate type information to allow the walker
function to cost the correct portions of the aggregate correctly based
on what type of aggregation the costs will be used for. In short,
please don't bother to spend too much time tuning your costs until I
fix this.

As of now the Partial Aggregate is including the cost of the final
function... that's certainly broken, as it does not call that
function.

I will try to get something together over the weekend to fix this, but
I have other work to do until then.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

David Rowley

david.rowley@2ndquadrant.com

almost 10 years ago

In reply to: David Rowley (#12)

Re: Parallel Queries and PostGIS

On 1 April 2016 at 17:12, David Rowley <david.rowley@2ndquadrant.com> wrote:

On 30 March 2016 at 09:14, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 29, 2016 at 3:48 PM, Paul Ramsey <pramsey@cleverelephant.ca> wrote:

I have no idea why the cost adjustments that you need are different
for the scan case and the aggregate case. That does seem problematic,
but I just don't know why it's happening.

What might be a good way to debug it? Is there a piece of code I can
look at to try and figure out the contribution of COST in either case?

Well, the cost calculations are mostly in costsize.c, but I dunno how
much that helps. Maybe it would help if you posted some EXPLAIN
ANALYZE output for the different cases, with and without parallelism?

One thing I noticed about this output (from your blog)...

Finalize Aggregate
(cost=16536.53..16536.79 rows=1 width=8)
(actual time=2263.638..2263.639 rows=1 loops=1)
-> Gather
(cost=16461.22..16461.53 rows=3 width=32)
(actual time=754.309..757.204 rows=4 loops=1)
Number of Workers: 3
-> Partial Aggregate
(cost=15461.22..15461.23 rows=1 width=32)
(actual time=676.738..676.739 rows=1 loops=4)
-> Parallel Seq Scan on pd
(cost=0.00..13856.38 rows=64 width=2311)
(actual time=3.009..27.321 rows=42 loops=4)
Filter: (fed_num = 47005)
Rows Removed by Filter: 17341
Planning time: 0.219 ms
Execution time: 2264.684 ms

...is that the finalize aggregate phase is estimated to be very cheap,
but it's actually wicked expensive. We get the results from the
workers in only 750 ms, but it takes another second and a half to
aggregate those 4 rows???

hmm, actually I've just realised that create_grouping_paths() should
be accounting agg_costs differently depending if it's partial
aggregation, finalize aggregation, or just normal. count_agg_clauses()
needs to be passed the aggregate type information to allow the walker
function to cost the correct portions of the aggregate correctly based
on what type of aggregation the costs will be used for. In short,
please don't bother to spend too much time tuning your costs until I
fix this.

As of now the Partial Aggregate is including the cost of the final
function... that's certainly broken, as it does not call that
function.

I will try to get something together over the weekend to fix this, but
I have other work to do until then.

Hi Paul,

As of deb71fa, committed by Robert today, you should have a bit more
control over parallel aggregate costings. You can how raise the
transfn cost, or drop the combinefn cost to encourage parallel
aggregation. Keep in mind the derialfn and deserialfn costs are now
also accounted for too.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Stephen Frost

sfrost@snowman.net

over 9 years ago

In reply to: Paul Ramsey (#3)

Re: Parallel Queries and PostGIS

Paul,

* Paul Ramsey (pramsey@cleverelephant.ca) wrote:

On Mon, Mar 28, 2016 at 9:45 AM, Stephen Frost <sfrost@snowman.net> wrote:

Would you agree that it'd be helpful to have for making the st_union()
work better in parallel?

For our particular situation w/ ST_Union, yes, it would be ideal to be
able to run a worker-side combine function as well as the master-side
one. Although the cascaded union would be less effective spread out
over N nodes, doing it only once per worker, rather than every N
records would minimize the loss of effectiveness.

I chatted with Robert a bit about this and he had an interesting
suggestion. I'm not sure that it would work for you, but the
serialize/deserialize functions are used to transfer the results from
the worker process to the main process. You could possibly do the
per-worker finalize work in the serialize function to get the benefit of
running that in parallel.

You'll need to mark the aggtranstype as 'internal' to have the
serialize/deserialize code called. Hopefully that's not too much of an
issue.

Thanks!

Stephen

#15

Paul Ramsey

pramsey@cleverelephant.ca

over 9 years ago

In reply to: Stephen Frost (#14)

Re: Parallel Queries and PostGIS

On Fri, Apr 22, 2016 at 11:44 AM, Stephen Frost <sfrost@snowman.net> wrote:

Paul,

* Paul Ramsey (pramsey@cleverelephant.ca) wrote:

On Mon, Mar 28, 2016 at 9:45 AM, Stephen Frost <sfrost@snowman.net> wrote:

Would you agree that it'd be helpful to have for making the st_union()
work better in parallel?

For our particular situation w/ ST_Union, yes, it would be ideal to be
able to run a worker-side combine function as well as the master-side
one. Although the cascaded union would be less effective spread out
over N nodes, doing it only once per worker, rather than every N
records would minimize the loss of effectiveness.

I chatted with Robert a bit about this and he had an interesting
suggestion. I'm not sure that it would work for you, but the
serialize/deserialize functions are used to transfer the results from
the worker process to the main process. You could possibly do the
per-worker finalize work in the serialize function to get the benefit of
running that in parallel.

You'll need to mark the aggtranstype as 'internal' to have the
serialize/deserialize code called. Hopefully that's not too much of an
issue.

Thanks Stephen. We were actually thinking that it might make more
sense to just do the parallel processing in our own threads in the
finalfunc. Not as elegant and magical as bolting into the PgSQL infra,
but if we're doing something hacky anyways, might as well be our own
hacky.

ATB,
P

Thanks!

Stephen

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers