using custom scan nodes to prototype parallel sequential scan

Started by Robert Haasover 11 years ago35 messageshackers

robertmhaas@gmail.com

over 11 years ago

On Wed, Oct 15, 2014 at 2:55 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Something usable, with severe restrictions, is actually better than we
have now. I understand the journey this work represents, so don't be
embarrassed by submitting things with heuristics and good-enoughs in
it. Our mentor, Mr.Lane, achieved much by spreading work over many
releases, leaving others to join in the task.

It occurs to me that, now that the custom-scan stuff is committed, it
wouldn't be that hard to use that, plus the other infrastructure we
already have, to write a prototype of parallel sequential scan. Given
where we are with the infrastructure, there would be a number of
unhandled problems, such as deadlock detection (needs group locking or
similar), assessment of quals as to parallel-safety (needs
proisparallel or similar), general waterproofing to make sure that
pushing down a qual we shouldn't does do anything really dastardly
like crash the server (another written but yet-to-be-published patch
adds a bunch of relevant guards), and snapshot sharing (likewise).
But if you don't do anything weird, it should basically work.

I think this would be useful for a couple of reasons. First, it would
be a demonstrable show of progress, illustrating how close we are to
actually having something you can really deploy. Second, we could use
it to demonstrate how the remaining infrastructure patches close up
gaps in the initial prototype. Third, it would let us start doing
real performance testing. It seems pretty clear that a parallel
sequential scan of data that's in memory (whether the page cache or
the OS cache) can be accelerated by having multiple processes scan it
in parallel. But it's much less clear what will happen when the data
is being read in from disk. Does parallelism help at all? What
degree of parallelism helps? Do we break OS readahead so badly that
performance actually regresses? These are things that are likely to
need a fair amount of tuning before this is ready for prime time, so
being able to start experimenting with them in advance of all of the
infrastructure being completely ready seems like it might help.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@anarazel.de

over 11 years ago

In reply to: Robert Haas (#1)

Re: using custom scan nodes to prototype parallel sequential scan

Hi Robert, All,

On 2014-11-10 10:57:16 -0500, Robert Haas wrote:

On Wed, Oct 15, 2014 at 2:55 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Something usable, with severe restrictions, is actually better than we
have now. I understand the journey this work represents, so don't be
embarrassed by submitting things with heuristics and good-enoughs in
it. Our mentor, Mr.Lane, achieved much by spreading work over many
releases, leaving others to join in the task.

It occurs to me that, now that the custom-scan stuff is committed, it
wouldn't be that hard to use that, plus the other infrastructure we
already have, to write a prototype of parallel sequential scan. Given
where we are with the infrastructure, there would be a number of
unhandled problems, such as deadlock detection (needs group locking or
similar), assessment of quals as to parallel-safety (needs
proisparallel or similar), general waterproofing to make sure that
pushing down a qual we shouldn't does do anything really dastardly
like crash the server (another written but yet-to-be-published patch
adds a bunch of relevant guards), and snapshot sharing (likewise).
But if you don't do anything weird, it should basically work.

I think this would be useful for a couple of reasons. First, it would
be a demonstrable show of progress, illustrating how close we are to
actually having something you can really deploy. Second, we could use
it to demonstrate how the remaining infrastructure patches close up
gaps in the initial prototype. Third, it would let us start doing
real performance testing.

I think it might be a useful experiment - as long as it's clear that
it's that. Which is, I think, what you're thinking about?

It seems pretty clear that a parallel sequential scan of data that's
in memory (whether the page cache or the OS cache) can be accelerated
by having multiple processes scan it in parallel.

Right.

But it's much less clear what will happen when the data is being read
in from disk.

I think that *very* heavily depends on the IO subsystem.

Does parallelism help at all?

I'm pretty damn sure. We can't even make a mildly powerfull storage
fully busy right now. Heck, I can't make my workstation's storage with a
raid 10 out of four spinning disks fully busy.

I think some of that benefit also could be reaped by being better at
hinting the OS...

What degree of parallelism helps?

That's quite a hard question. Generally the question about how much
parallelism for what is beneficial will be one of the most complicated
areas once the plumbing is in.

Do we break OS readahead so badly that performance actually regresses?

I don't think it's likely that we break OS readahead - that works on a
per task basis at least on linux afaik. But it's nonetheless very easy
to have too many streams causing to many random reads.

These are things that are likely to
need a fair amount of tuning before this is ready for prime time, so
being able to start experimenting with them in advance of all of the
infrastructure being completely ready seems like it might help.

I'm not actually entirely sure how much that's going to help. I think
you could very well have a WIP patch ready reasonably quick that doesn't
solve the issues you mention above by patching it in. For the kind of
testing we're talking about that seems likely sufficient - a git branch
somewhere probably is actually easier to compile for people than some
contrib module that needs to be loaded...
And I *do* think that you'll very quickly hit the limits of the custom
scan API. And I'd much rather see you work on improving parallelism than
the custom scan stuff, just so you can prototype further ahead.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Haribabu Kommi

kommi.haribabu@gmail.com

over 11 years ago

In reply to: Andres Freund (#2)

Re: using custom scan nodes to prototype parallel sequential scan

On Tue, Nov 11, 2014 at 10:21 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-11-10 10:57:16 -0500, Robert Haas wrote:

Does parallelism help at all?

I'm pretty damn sure. We can't even make a mildly powerfull storage
fully busy right now. Heck, I can't make my workstation's storage with a
raid 10 out of four spinning disks fully busy.

I think some of that benefit also could be reaped by being better at
hinting the OS...

Yes, it definitely helps but not only limited to IO bound operations.
It gives a good gain for the queries having CPU intensive where conditions.

One more point we may need to consider, is there any overhead in passing
the data row from workers to backend? I feel this may play a major role
when the selectivity is more.

Regards,
Hari Babu
Fujitsu Australia

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Haribabu Kommi (#3)

Re: using custom scan nodes to prototype parallel sequential scan

On Tue, Nov 11, 2014 at 5:30 AM, Haribabu Kommi <kommi.haribabu@gmail.com>
wrote:

On Tue, Nov 11, 2014 at 10:21 AM, Andres Freund <andres@2ndquadrant.com>

wrote:

On 2014-11-10 10:57:16 -0500, Robert Haas wrote:

Does parallelism help at all?

I'm pretty damn sure. We can't even make a mildly powerfull storage
fully busy right now. Heck, I can't make my workstation's storage with a
raid 10 out of four spinning disks fully busy.

I think some of that benefit also could be reaped by being better at
hinting the OS...

Yes, it definitely helps but not only limited to IO bound operations.
It gives a good gain for the queries having CPU intensive where

conditions.

One more point we may need to consider, is there any overhead in passing
the data row from workers to backend?

I am not sure if that overhead will be too much visible if we improve the
use of I/O subsystem by making parallel tasks working on it. However
another idea here could be that instead of passing tuple data, we just
pass tuple id, but in that case we have to retain the pin on the buffer
that contains tuple untill master backend reads from it that might have
it's own kind of problems.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Haribabu Kommi

kommi.haribabu@gmail.com

over 11 years ago

In reply to: Amit Kapila (#4)

Re: using custom scan nodes to prototype parallel sequential scan

On Tue, Nov 11, 2014 at 2:35 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Nov 11, 2014 at 5:30 AM, Haribabu Kommi <kommi.haribabu@gmail.com>
wrote:

On Tue, Nov 11, 2014 at 10:21 AM, Andres Freund <andres@2ndquadrant.com>
wrote:

On 2014-11-10 10:57:16 -0500, Robert Haas wrote:

Does parallelism help at all?

I'm pretty damn sure. We can't even make a mildly powerfull storage
fully busy right now. Heck, I can't make my workstation's storage with a
raid 10 out of four spinning disks fully busy.

I think some of that benefit also could be reaped by being better at
hinting the OS...

Yes, it definitely helps but not only limited to IO bound operations.
It gives a good gain for the queries having CPU intensive where
conditions.

One more point we may need to consider, is there any overhead in passing
the data row from workers to backend?

I am not sure if that overhead will be too much visible if we improve the
use of I/O subsystem by making parallel tasks working on it.

I feel there may be an overhead because of workers needs to put the result
data in the shared memory and the backend has to read from there to process
it further. If the cost of transfering data from worker to backend is more than
fetching a tuple from the scan, then the overhead is visible when the
selectivity is more.

However
another idea here could be that instead of passing tuple data, we just
pass tuple id, but in that case we have to retain the pin on the buffer
that contains tuple untill master backend reads from it that might have
it's own kind of problems.

Transfering tuple id doesn't solve the scenarios if the node needs any
projection.

Regards,
Hari Babu
Fujitsu Australia

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Amit Kapila

amit.kapila16@gmail.com

over 11 years ago

In reply to: Haribabu Kommi (#5)

Re: using custom scan nodes to prototype parallel sequential scan

On Tue, Nov 11, 2014 at 9:42 AM, Haribabu Kommi <kommi.haribabu@gmail.com>
wrote:

On Tue, Nov 11, 2014 at 2:35 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

On Tue, Nov 11, 2014 at 5:30 AM, Haribabu Kommi <

kommi.haribabu@gmail.com>

wrote:

On Tue, Nov 11, 2014 at 10:21 AM, Andres Freund <andres@2ndquadrant.com

wrote:

On 2014-11-10 10:57:16 -0500, Robert Haas wrote:

Does parallelism help at all?

I'm pretty damn sure. We can't even make a mildly powerfull storage
fully busy right now. Heck, I can't make my workstation's storage

with a

raid 10 out of four spinning disks fully busy.

I think some of that benefit also could be reaped by being better at
hinting the OS...

Yes, it definitely helps but not only limited to IO bound operations.
It gives a good gain for the queries having CPU intensive where
conditions.

One more point we may need to consider, is there any overhead in

passing

the data row from workers to backend?

I am not sure if that overhead will be too much visible if we improve

the

use of I/O subsystem by making parallel tasks working on it.

I feel there may be an overhead because of workers needs to put the result
data in the shared memory and the backend has to read from there to

process

it further. If the cost of transfering data from worker to backend is

more than

fetching a tuple from the scan, then the overhead is visible when the
selectivity is more.

However
another idea here could be that instead of passing tuple data, we just
pass tuple id, but in that case we have to retain the pin on the buffer
that contains tuple untill master backend reads from it that might have
it's own kind of problems.

Transfering tuple id doesn't solve the scenarios if the node needs any
projection.

Hmm, that's why I told that we need to retain buffer pin, so that we can
get the tuple data.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

KaiGai Kohei

kaigai@ak.jp.nec.com

over 11 years ago

In reply to: Haribabu Kommi (#5)

Re: using custom scan nodes to prototype parallel sequential scan

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Haribabu Kommi
Sent: Tuesday, November 11, 2014 1:13 PM
To: Amit Kapila
Cc: Andres Freund; Robert Haas; Simon Riggs; Tom Lane;
pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] using custom scan nodes to prototype parallel
sequential scan

On Tue, Nov 11, 2014 at 2:35 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Tue, Nov 11, 2014 at 5:30 AM, Haribabu Kommi
<kommi.haribabu@gmail.com>
wrote:

On Tue, Nov 11, 2014 at 10:21 AM, Andres Freund
<andres@2ndquadrant.com>
wrote:

On 2014-11-10 10:57:16 -0500, Robert Haas wrote:

Does parallelism help at all?

I'm pretty damn sure. We can't even make a mildly powerfull storage
fully busy right now. Heck, I can't make my workstation's storage
with a raid 10 out of four spinning disks fully busy.

I think some of that benefit also could be reaped by being better
at hinting the OS...

Yes, it definitely helps but not only limited to IO bound operations.
It gives a good gain for the queries having CPU intensive where
conditions.

One more point we may need to consider, is there any overhead in
passing the data row from workers to backend?

I am not sure if that overhead will be too much visible if we improve
the use of I/O subsystem by making parallel tasks working on it.

I feel there may be an overhead because of workers needs to put the result
data in the shared memory and the backend has to read from there to process
it further. If the cost of transfering data from worker to backend is more
than fetching a tuple from the scan, then the overhead is visible when the
selectivity is more.

In my experience, data copy and transformation to fit TupleTableSlot is the
biggest overhead, rather than scan or join itself...
Probably, a straight-forward way is to construct an array of values/isnull
on a shared memory segment, then the backend process just switch pointers of
tts_values/tts_isnull, with no data copy. It gave us a performance gain.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Simon Riggs

simon@2ndQuadrant.com

over 11 years ago

In reply to: Robert Haas (#1)

Re: using custom scan nodes to prototype parallel sequential scan

On 10 November 2014 15:57, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Oct 15, 2014 at 2:55 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Something usable, with severe restrictions, is actually better than we
have now. I understand the journey this work represents, so don't be
embarrassed by submitting things with heuristics and good-enoughs in
it. Our mentor, Mr.Lane, achieved much by spreading work over many
releases, leaving others to join in the task.

It occurs to me that, now that the custom-scan stuff is committed, it
wouldn't be that hard to use that, plus the other infrastructure we
already have, to write a prototype of parallel sequential scan.

Given
where we are with the infrastructure, there would be a number of
unhandled problems, such as deadlock detection (needs group locking or
similar), assessment of quals as to parallel-safety (needs
proisparallel or similar), general waterproofing to make sure that
pushing down a qual we shouldn't does do anything really dastardly
like crash the server (another written but yet-to-be-published patch
adds a bunch of relevant guards), and snapshot sharing (likewise).
But if you don't do anything weird, it should basically work.

If we build something fairly restricted, but production ready we can
still do something useful to users.

* only functions marked as "CONTAINS NO SQL"
* parallel_workers = 2 or fixed
* only one Plan type, designed to maximise benefit and minimize optimizer change

Why?

* The above is all that is needed to test any infrastructure changes
we accept. We need both infrastructure and tests. We could do this as
a custom scan plugin, but why not make it work in core - that doesn't
prevent a plugin version also if you desire it.

* only functions marked as "CONTAINS NO SQL"
We don't really know what proisparallel is, but we do know what
CONTAINS NO SQL means and can easily check for it.
Plus I already have a patch for this, slightly bitrotted.

* parallel_workers = 2 (or at least not make it user settable)
By fixing the number of workers at 2 we avoid any problems caused by
having N variable, such as how to vary N fairly amongst users and
other such considerations. We get the main benefit of parallelism,
without causing other issues across the server.

* Fixed Plan: aggregate-scan
To make everything simpler, allow only plans of a single type.
SELECT something, list of aggregates
FROM foo
WHERE filters
GROUP BY something
because we know that passing large amounts of data from worker to
master process will be slow, so focusing only on seq scan is not
sensible; we should focus on plans that significantly reduce the
number of rows passed upwards. We could just do this for very
selective WHERE clauses, but that is not an important class of query.
As soon as include aggregates, we reduce data passing significantly
AND we hit a very important subset of queries:

This plan type is widely used in reporting queries, so will hit the
mainline of BI applications and many Mat View creations.
This will allow SELECT count(*) FROM foo to go faster also.

The execution plan for that query type looks like this...
Hash Aggregate
Gather From Workers
{Worker Nodes workers = 2
HashAggregate
PartialScan}

Which is simple enough that we include a mechanism for the Gather
operation, plus it is simple enough to not need extensive optimizer
changes - just changes to set_plain_rel_pathlist() to consider
parallel plans.

Plan costs are easily to calculate for the above...
cpu_worker_startup_cost = 100 -- Startup cost is easy to calculate by
observation, but a reasonably large default will be OK
cpu_ipc_tuple_cost = 0.1 -- assume x10 normal cost of cpu_tuple_cost
Partial scan costs are just same as SeqScan, just with fewer blocks.
All other costs are the same

We can submit the main patch by Dec 15, fix all the problems by Feb 15.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

KaiGai Kohei

kaigai@ak.jp.nec.com

over 11 years ago

In reply to: Simon Riggs (#8)

Re: using custom scan nodes to prototype parallel sequential scan

Given
where we are with the infrastructure, there would be a number of
unhandled problems, such as deadlock detection (needs group locking or
similar), assessment of quals as to parallel-safety (needs
proisparallel or similar), general waterproofing to make sure that
pushing down a qual we shouldn't does do anything really dastardly
like crash the server (another written but yet-to-be-published patch
adds a bunch of relevant guards), and snapshot sharing (likewise).
But if you don't do anything weird, it should basically work.

If we build something fairly restricted, but production ready we can still
do something useful to users.

* only functions marked as "CONTAINS NO SQL"
* parallel_workers = 2 or fixed
* only one Plan type, designed to maximise benefit and minimize optimizer
change

Why?

* The above is all that is needed to test any infrastructure changes we
accept. We need both infrastructure and tests. We could do this as a custom
scan plugin, but why not make it work in core - that doesn't prevent a plugin
version also if you desire it.

* only functions marked as "CONTAINS NO SQL"
We don't really know what proisparallel is, but we do know what CONTAINS
NO SQL means and can easily check for it.
Plus I already have a patch for this, slightly bitrotted.

Isn't provolatile = PROVOLATILE_IMMUTABLE sufficient?

* parallel_workers = 2 (or at least not make it user settable) By fixing
the number of workers at 2 we avoid any problems caused by having N variable,
such as how to vary N fairly amongst users and other such considerations.
We get the main benefit of parallelism, without causing other issues across
the server.

* Fixed Plan: aggregate-scan
To make everything simpler, allow only plans of a single type.
SELECT something, list of aggregates
FROM foo
WHERE filters
GROUP BY something
because we know that passing large amounts of data from worker to master
process will be slow, so focusing only on seq scan is not sensible; we should
focus on plans that significantly reduce the number of rows passed upwards.
We could just do this for very selective WHERE clauses, but that is not
an important class of query.
As soon as include aggregates, we reduce data passing significantly AND
we hit a very important subset of queries:

This plan type is widely used in reporting queries, so will hit the mainline
of BI applications and many Mat View creations.
This will allow SELECT count(*) FROM foo to go faster also.

I agree with you. The key of performance is how many rows can be reduced
by parallel processor, however, one thing I doubt is that the condition
to kick parallel execution may be too restrictive.
I'm not sure how much workloads run aggregate functions towards flat tables
rather than multiple joined relations.

Sorry, it might be a topic to discuss in a separated thread.
Can't we have a functionality to push-down aggregate functions across table
joins? If we could do this in some cases, but not any cases, it makes sense
to run typical BI workload in parallel.

Let me assume the following query:

SELECT SUM(t1.X), SUM(t2.Y) FROM t1 JOIN t2 ON t1.pk_id = t2.fk_id GROUP BY t1.A, t2.B;

It is a usual query that joins two relations with PK and FK.

It can be broken down, like:

SELECT SUM(t1.X), SUM(sq2.Y)
FROM t1 JOIN (SELECT t2.fk_id, t2.B, SUM(t2.Y) Y FROM t2 GROUP BY t2.fk_id, t2.B) sq2
GROUP BY t1.A, sq2.B;

If FK has 100 tuples per one PK in average, aggregate function that runs prior
to join will reduce massive number of tuples to be joined, and we can leverage
parallel processors to make preliminary aggregate on flat table.

I'm now checking the previous academic people's job to identify the condition
what kind of join allows to push down aggregate functions.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.54.7751

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: Simon Riggs (#8)

Re: using custom scan nodes to prototype parallel sequential scan

On Tue, Nov 11, 2014 at 3:29 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

* only functions marked as "CONTAINS NO SQL"
We don't really know what proisparallel is, but we do know what
CONTAINS NO SQL means and can easily check for it.
Plus I already have a patch for this, slightly bitrotted.

Interestingly, I have a fairly solid idea of what proisparallel is,
but I have no clear idea what CONTAINS NO SQL is or why it's relevant.
I would imagine that srandom() contains no SQL under any reasonable
definition of what that means, but it ain't parallel-safe.

* parallel_workers = 2 (or at least not make it user settable)
By fixing the number of workers at 2 we avoid any problems caused by
having N variable, such as how to vary N fairly amongst users and
other such considerations. We get the main benefit of parallelism,
without causing other issues across the server.

I think this is a fairly pointless restriction. The code
simplification we'll get out of it appears to me to be quite minor,
and we'll just end up putting the stuff back in anyway.

* Fixed Plan: aggregate-scan
To make everything simpler, allow only plans of a single type.
SELECT something, list of aggregates
FROM foo
WHERE filters
GROUP BY something
because we know that passing large amounts of data from worker to
master process will be slow, so focusing only on seq scan is not
sensible; we should focus on plans that significantly reduce the
number of rows passed upwards. We could just do this for very
selective WHERE clauses, but that is not an important class of query.
As soon as include aggregates, we reduce data passing significantly
AND we hit a very important subset of queries:

This is moving the goalposts in a way that I'm not at all comfortable
with. Parallel sequential-scan is pretty simple and may well be a win
if there's a restrictive filter condition involved. Parallel
aggregation requires introducing new infrastructure into the aggregate
machinery to allow intermediate state values to be combined, and that
would be a great project for someone to do at some time, but it seems
like a distraction for me to do that right now.

This plan type is widely used in reporting queries, so will hit the
mainline of BI applications and many Mat View creations.
This will allow SELECT count(*) FROM foo to go faster also.

The execution plan for that query type looks like this...
Hash Aggregate
Gather From Workers
{Worker Nodes workers = 2
HashAggregate
PartialScan}

I'm going to aim for the simpler:

Hash Aggregate
-> Parallel Seq Scan
Workers: 4

Yeah, I know that won't perform as well as what you're proposing, but
I'm fairly sure it's simpler.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

David Rowley

dgrowleyml@gmail.com

over 11 years ago

In reply to: Simon Riggs (#8)

Re: using custom scan nodes to prototype parallel sequential scan

On Tue, Nov 11, 2014 at 9:29 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

This plan type is widely used in reporting queries, so will hit the
mainline of BI applications and many Mat View creations.
This will allow SELECT count(*) FROM foo to go faster also.

We'd also need to add some infrastructure to merge aggregate states
together for this to work properly. This means that could also work for
avg() and stddev etc. For max() and min() the merge functions would likely
just be the same as the transition functions.

Regards

David Rowley

#12

Atri Sharma

atri.jiit@gmail.com

over 11 years ago

In reply to: David Rowley (#11)

Re: using custom scan nodes to prototype parallel sequential scan

On Wed, Nov 12, 2014 at 1:24 PM, David Rowley <dgrowleyml@gmail.com> wrote:

On Tue, Nov 11, 2014 at 9:29 PM, Simon Riggs <simon@2ndquadrant.com>
wrote:

This plan type is widely used in reporting queries, so will hit the
mainline of BI applications and many Mat View creations.
This will allow SELECT count(*) FROM foo to go faster also.

We'd also need to add some infrastructure to merge aggregate states
together for this to work properly. This means that could also work for
avg() and stddev etc. For max() and min() the merge functions would likely
just be the same as the transition functions.

It might make sense to make a new planner operator which can be responsible
for pulling from each of the individual parallel Agg nodes and then
aggregating over the results.

A couple of things that might be worth considering are whether we want to
enforce using parallel aggregation or let planner decide if it wants to do
a parallel aggregate or go with a single plan. For eg, the average
estimated size of groups might be one thing that planner may consider while
deciding between a parallel and a single execution plan.

I dont see merging states as an easy problem, and should perhaps be tackled
apart from this thread.

Also, do we want to allow parallelism only with GroupAggs?

Regards,

Atri

#13

Robert Haas

robertmhaas@gmail.com

over 11 years ago

In reply to: KaiGai Kohei (#9)

Re: using custom scan nodes to prototype parallel sequential scan

On Tue, Nov 11, 2014 at 7:48 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

Isn't provolatile = PROVOLATILE_IMMUTABLE sufficient?

There are certainly things that are parallel-safe that are not
immutable. It might be the case that everything immutable is
parallel-safe.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Michael Paquier

michael@paquier.xyz

over 11 years ago

In reply to: Robert Haas (#13)

Re: using custom scan nodes to prototype parallel sequential scan

On Wed, Nov 12, 2014 at 9:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Nov 11, 2014 at 7:48 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

Isn't provolatile = PROVOLATILE_IMMUTABLE sufficient?

There are certainly things that are parallel-safe that are not
immutable. It might be the case that everything immutable is
parallel-safe.

FWIW, when working on the concept of expression and clause
shippability for Postgres-XC (aka the possibility to pass it safely to
another backend, but in another PG node in this case), we discussed
similar things and if I recall correctly we even discussed about
adding a flag to pg_proc to define if a function was shippable or not.
Finally what we finished with was not adding a new flag and use as
rule that all the immutable functions can be safely shipped, and
others not, even some stable functions that *could* be safe. Maybe
Ashutosh has more comments on that, my memory may be failing.
In the end, I think that you would finish with something similar.
My 2c.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Simon Riggs

simon@2ndQuadrant.com

over 11 years ago

In reply to: Robert Haas (#10)

Re: using custom scan nodes to prototype parallel sequential scan

On 12 November 2014 00:54, Robert Haas <robertmhaas@gmail.com> wrote:

I'm going to aim for the simpler:

Hash Aggregate
-> Parallel Seq Scan
Workers: 4

Yeah, I know that won't perform as well as what you're proposing, but
I'm fairly sure it's simpler.

Simple is best, so +1.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Simon Riggs

simon@2ndQuadrant.com

over 11 years ago

In reply to: David Rowley (#11)

Re: using custom scan nodes to prototype parallel sequential scan

On 12 November 2014 07:54, David Rowley <dgrowleyml@gmail.com> wrote:

On Tue, Nov 11, 2014 at 9:29 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

This plan type is widely used in reporting queries, so will hit the
mainline of BI applications and many Mat View creations.
This will allow SELECT count(*) FROM foo to go faster also.

We'd also need to add some infrastructure to merge aggregate states together
for this to work properly. This means that could also work for avg() and
stddev etc. For max() and min() the merge functions would likely just be the
same as the transition functions.

Do you mean something like a "subtotal" or "intermediate combination" functon?

I guess we'd need the same thing to make intermediate aggregates work
during a sort?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Simon Riggs

simon@2ndQuadrant.com

over 11 years ago

In reply to: Robert Haas (#10)

Re: using custom scan nodes to prototype parallel sequential scan

On 12 November 2014 00:54, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Nov 11, 2014 at 3:29 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

* only functions marked as "CONTAINS NO SQL"
We don't really know what proisparallel is, but we do know what
CONTAINS NO SQL means and can easily check for it.
Plus I already have a patch for this, slightly bitrotted.

Interestingly, I have a fairly solid idea of what proisparallel is,
but I have no clear idea what CONTAINS NO SQL is or why it's relevant.
I would imagine that srandom() contains no SQL under any reasonable
definition of what that means, but it ain't parallel-safe.

What is wrong in generating random numbers in parallel?

But I'm sure many volatile functions would be annoying to support, so
CONTAINS NO SQL and STABLE/IMMUTABLE seems OK for the first thing.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

David Rowley

dgrowleyml@gmail.com

over 11 years ago

In reply to: Simon Riggs (#16)

Re: using custom scan nodes to prototype parallel sequential scan

On Fri, Nov 14, 2014 at 1:19 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 12 November 2014 07:54, David Rowley <dgrowleyml@gmail.com> wrote:

On Tue, Nov 11, 2014 at 9:29 PM, Simon Riggs <simon@2ndquadrant.com>

wrote:

This plan type is widely used in reporting queries, so will hit the
mainline of BI applications and many Mat View creations.
This will allow SELECT count(*) FROM foo to go faster also.

We'd also need to add some infrastructure to merge aggregate states

together

for this to work properly. This means that could also work for avg() and
stddev etc. For max() and min() the merge functions would likely just be

the

same as the transition functions.

Do you mean something like a "subtotal" or "intermediate combination"
functon?

If you had 4 parallel workers performing a seqscan, say the relation was
4000 pages in total, you could say that worker 1 would scan blocks 0-999,
worker 2, 1000-1999 etc. After each worker had finished, there would then
be 4 sets of records then needed to be merged into 1 set.

Take int4_avg_accum() for example it does:

transdata->count++;
transdata->sum += newval;

The merge function would need to perform something like:

transdata->count += transdata2merge.count;
transdata->sum += transdata2merge.sum;

Then the final function could be called on the merged aggregate state.

The same can be applied when the query contains a GROUP BY clause, just
we'd need pay attention to which groups we merge together for that to work
Any HAVING clause would have to be applied after the groups have been
merged.

This whole topic is pretty exciting for data warehouse type workloads.

Regards

David Rowley

#19

KaiGai Kohei

kaigai@ak.jp.nec.com

over 11 years ago

In reply to: David Rowley (#18)

Re: using custom scan nodes to prototype parallel sequential scan

On 12 November 2014 07:54, David Rowley <dgrowleyml@gmail.com>
wrote:

On Tue, Nov 11, 2014 at 9:29 PM, Simon Riggs

<simon@2ndquadrant.com> wrote:

This plan type is widely used in reporting queries, so will hit

the

mainline of BI applications and many Mat View creations.
This will allow SELECT count(*) FROM foo to go faster also.

We'd also need to add some infrastructure to merge aggregate

states together

for this to work properly. This means that could also work for

avg() and

stddev etc. For max() and min() the merge functions would likely

just be the

same as the transition functions.

Do you mean something like a "subtotal" or "intermediate
combination" functon?

If you had 4 parallel workers performing a seqscan, say the relation was
4000 pages in total, you could say that worker 1 would scan blocks 0-999,
worker 2, 1000-1999 etc. After each worker had finished, there would then
be 4 sets of records then needed to be merged into 1 set.

Take int4_avg_accum() for example it does:

transdata->count++;
transdata->sum += newval;

The merge function would need to perform something like:

transdata->count += transdata2merge.count; sum += transdata2merge.sum;

Then the final function could be called on the merged aggregate state.

The same can be applied when the query contains a GROUP BY clause, just
we'd need pay attention to which groups we merge together for that to work
Any HAVING clause would have to be applied after the groups have been merged.

This whole topic is pretty exciting for data warehouse type workloads.

More simplify, we can describe parallel aware aggregate function.
Please assume AVG(X) function that takes nrows and sum of X. Its transition
function performs as like you described above, then final function works as
usual.

The job of parallel seq scan needs to do is:
1. replace AVG(X) by AVG(nrows, sum(X)
2. generate count(*) on the partial relation being grouped.
3. generate sum(X) on the partial relation being grouped.

It looks like the following query:
SELECT AVG(nrows, sum_X) FROM (
SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno between 0 and 999 GROUP BY cat
UNION
SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno between 1000 and 1999 GROUP BY cat
UNION
SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno between 2000 and 2999 GROUP BY cat
UNION
SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno between 3000 and 3999 GROUP BY cat
);

Note that stuff inside sub-query is a job of parallel scan.

Do we need to invent a new infrastructure here?

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

David Rowley

dgrowleyml@gmail.com

over 11 years ago

In reply to: KaiGai Kohei (#19)

Re: using custom scan nodes to prototype parallel sequential scan

On Fri, Nov 14, 2014 at 2:12 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

On 12 November 2014 07:54, David Rowley <dgrowleyml@gmail.com>
Take int4_avg_accum() for example it does:

transdata->count++;
transdata->sum += newval;

The merge function would need to perform something like:

transdata->count += transdata2merge.count; sum += transdata2merge.sum;

Then the final function could be called on the merged aggregate state.

More simplify, we can describe parallel aware aggregate function.
Please assume AVG(X) function that takes nrows and sum of X. Its transition
function performs as like you described above, then final function works as
usual.

The job of parallel seq scan needs to do is:
1. replace AVG(X) by AVG(nrows, sum(X)
2. generate count(*) on the partial relation being grouped.
3. generate sum(X) on the partial relation being grouped.

It looks like the following query:
SELECT AVG(nrows, sum_X) FROM (
SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno between 0 and
999 GROUP BY cat
UNION
SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno between 1000
and 1999 GROUP BY cat
UNION
SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno between 2000
and 2999 GROUP BY cat
UNION
SELECT count(*) nrows, sum(X) sum_X FROM tbl WHERE blkno between 3000
and 3999 GROUP BY cat
);

Well, this would require giving the planner some kind of knowledge of what
AVG() is. It currently knows nothing about that. It currently calls the
transition function for each row, and the final function at the end and
does not care or need to care about what either of those functions actually
does. The transformation above looks like it would need to care and the
logic to add that would be way more complex than aggregate merge functions.

Likely for most aggregates, like count, sum, max, min, bit_and and bit_or
the merge function would be the same as the transition function, as the
state type is just the same as the input type. It would only be aggregates
like avg(), stddev*(), bool_and() and bool_or() that would need a new merge
function made... These would be no more complex than the transition
functions... Which are just a few lines of code anyway.

We'd simply just not run parallel query if any aggregates used in the query
didn't have a merge function.

When I mentioned this, I didn't mean to appear to be placing a road block.I
was just bringing to the table the information that COUNT(*) + COUNT(*)
works ok for merging COUNT(*)'s "sub totals", but AVG(n) + AVG(n) does not.

Merge functions should be a simple patch to write. If I thought there was
going to be a use for them in this release, I'd put my hand up to put a
patch together.

Regards

David Rowley

#21

KaiGai Kohei

kaigai@ak.jp.nec.com

over 11 years ago

In reply to: David Rowley (#20)

#22