Implementation of GROUPING SETS (T431: Extended grouping capabilities)

Started by Олег Царевabout 17 years ago34 messageshackers

zabivator@gmail.com

about 17 years ago

Hello all.
Please, approve my ideas for implementation.

Standart has feature T431: Extended grouping capabilities.
This feature i found in TODO-list:
http://wiki.postgresql.org/wiki/Todo -> SQL Commands -> TO DO

MS SQL 2005 partial support this feature:
http://www.kodyaz.com/articles/sql-server-2005-cube-rollup-cannot-compute-distinct-aggregates.aspx
http://blogs.msdn.com/craigfr/archive/2007/09/21/aggregation-with-rollup.aspx

MS SQL 2008 support this feature:
http://blogs.msdn.com/craigfr/archive/2007/10/11/grouping-sets-in-sql-server-2008.aspx

Oracle support this feature:
http://www.compshack.com/sql/oracle-group-rollup

So, it's short notes about GROUPING SETS, but more complete
information have in a official documentation of MS SQL and Oracle
(copyright limited for send as attach).

First. GROUPG SETS.

select A,B,C,SUM(D) from table group by GROUPING SETS( (A,B,C), (A),
() ) - it's example of use grouping sets.
Semantic of this construction - make group by over source more, than
one group of column.
It's very wide key - A,B C. In result set of this example we can find
result set of select select A,B,C,SUM(D) from table group by A,B,C -
as subset. It's mind: "GROUP BY A,B,C" - subset of "GROUP BY GROUPING
SETS( (A,B,C), (A), () )
Two subset - is GROUP BY A B, and instead C column we look NULL.
Third subset - GROUP BY (), instead A,B,C - NULL, one row - it's name
"GRAND TOTAL". - calculate over all subset without grouping

Also have function "GROUPING" it's function say about null - "real
null" (from table) or generated by "GROUP BY GROUPING SETS"

My point: this feature can implement over GROUP BY and UNION ALL
We can make decomposition of "GROUP BY GROUPING SETS( (A,B,C),(A),()
)" to select A,B,C fron table GROUP BY A,B,C .UNION ALL select
A,B,NULL from table group BY A,B UNION ALL NUll,NUll,NULL from table
group by();

So, it's very simple, don't require modification of executor and
callibrate cost - only parser and semantic anylysis,
'
So, ROLLUP(A1,...An) is alias to "GROUP BY GROUPING SETS(
(A1,...,An),(A1,...,An-1),... (An-1,An),(An),() ),
CUBE - analogue.

If this idea it's good - i can write code base on old patch
http://archives.postgresql.org/pgsql-hackers/2008-10/msg00838.php or
from clean list (as you wish).

In future i know how to implement ROLLUP more optimal (executor
iterator) and use this ROLLUP for optimisation another GROUP BY,
GROUPING SETS.

Thanks.

Pavel Stehule

pavel.stehule@gmail.com

about 17 years ago

In reply to: Олег Царев (#1)

Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

Hello

some other info is on http://wiki.postgresql.org/wiki/Grouping_Sets

Maybe in e few weak I'll have some a prototype based on CTE. My older
technique based on hashed tables should be well, but it carry lot of
unshared code. Using CTE means so we can optimize CTE and GROUPING
SETS together. I plan to transform query

SELECT * FROM some GROUP BY GROUPING SETS(a, b)

WITH q AS (SELECT * FROM some)
SELECT * FROM q GROUP BY a
UNION ALL
SELECT * FROM q GROUP BY b

2009/5/10 Олег Царев <zabivator@gmail.com>:

Hello all.
Please, approve my ideas for implementation.

Standart has feature T431: Extended grouping capabilities.
This feature i found in TODO-list:
http://wiki.postgresql.org/wiki/Todo -> SQL Commands -> TO DO

MS SQL 2005 partial support this feature:
http://www.kodyaz.com/articles/sql-server-2005-cube-rollup-cannot-compute-distinct-aggregates.aspx
http://blogs.msdn.com/craigfr/archive/2007/09/21/aggregation-with-rollup.aspx

MS SQL 2008 support this feature:
http://blogs.msdn.com/craigfr/archive/2007/10/11/grouping-sets-in-sql-server-2008.aspx

Oracle support this feature:
http://www.compshack.com/sql/oracle-group-rollup

So, it's short notes about GROUPING SETS, but more complete
information have in a official documentation of MS SQL and Oracle
(copyright limited for send as attach).

First. GROUPG SETS.

select A,B,C,SUM(D) from table group by GROUPING SETS( (A,B,C), (A),
() ) - it's example of use grouping sets.
Semantic of this construction - make group by over source more, than
one group of column.
It's very wide key - A,B C. In result set of this example we can find
result set of select select A,B,C,SUM(D) from table group by A,B,C -
as subset. It's mind: "GROUP BY A,B,C" - subset of "GROUP BY GROUPING
SETS( (A,B,C), (A), () )
Two subset - is GROUP BY A B, and instead C column we look NULL.
Third subset - GROUP BY (), instead A,B,C - NULL, one row - it's name
"GRAND TOTAL". - calculate over all subset without grouping

Also have function "GROUPING" it's function say about null - "real
null" (from table) or generated by "GROUP BY GROUPING SETS"

My point: this feature can implement over GROUP BY and UNION ALL
We can make decomposition of "GROUP BY GROUPING SETS( (A,B,C),(A),()
)" to select A,B,C fron table GROUP BY A,B,C .UNION ALL select
A,B,NULL from table group BY A,B UNION ALL NUll,NUll,NULL from table
group by();

So, it's very simple, don't require modification of executor and
callibrate cost - only parser and semantic anylysis,
'
So, ROLLUP(A1,...An) is alias to "GROUP BY GROUPING SETS(
(A1,...,An),(A1,...,An-1),... (An-1,An),(An),() ),
CUBE - analogue.

If this idea it's good - i can write code base on old patch
http://archives.postgresql.org/pgsql-hackers/2008-10/msg00838.php or
from clean list (as you wish).

This is suboptimal. When SELECT * FROM X GROUP BY GROUPING SETS ...,
where X is some joined data, then you repeat JOIN on every grouping
set. So your solution is simple for implementation, but it should be
really slow.

Regards
Pavel Stehule

Show quoted text

In future i know how to implement ROLLUP more optimal (executor
iterator) and use this ROLLUP for optimisation another GROUP BY,
GROUPING SETS.

Thanks.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Олег Царев

zabivator@gmail.com

about 17 years ago

In reply to: Pavel Stehule (#2)

Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

Hello, Pavel.

I read you letter, and found some points:

a) Simple transformation any GROUPING SETS in GROUP BY + UNION ALL
require clone source for every group.
It's so slow.
My point: we can make for start simple implementation.
b) Your sollution it's so good, IMHO
WITH q AS (SELECT * FROM some)
SELECT * FROM q GROUP BY a
UNION ALL
SELECT * FROM q GROUP BY b
But can you describe me how it's used on planner (calculate cost) and exector.
Sharing source - it's mind - we don't process tree, we process
DIRECTED GRAPH. Hard, many bugs, so on, not?
PostgreSQL support sharing nodes of executor? How? Where i can read?
Next. Ok, we spool source. After we use simple hash group/sort group +
union all? Or more advanced techniques? If different group by - it's
not different from my idea, it's logic additional (spool source for
fast rewind)

c) Don't forget also about node requiments - if we use hash table as
container for source - we reduce some sorting properties - and query
select A,B from TABLE group by ((A,B),(B)) order by a,b require
sorting on output of grouping sets.
It's mind - we need insert sort.

d) We can make SORT ROLLUP and SORT REDUCE ROLLUP.
first - sort group by use sorting for grouping data and calculate aggregations.
So, 'order by A,B,C' also 'order by A,B' also order by 'A' also order by ''
What it's mind? It's mind we can use sort features for implement SORT
ROLLUP. For every sub group we calculate self aggregates... oh, i bad
say. Look on "select a,sum(b) from table group by a". Sort group
grouping by a, and for every a-group calculate aggregations (sum), for
ROLLUP A,B we can make three aggregations ; for A,B than A than () -
and use one path on source calculate aggregations on every step, and
for split source on different subsets.
It's more perfomance, than different group by and union all.

Look for MS SQL:
http://www.kodyaz.com/articles/sql-server-2005-cube-rollup-cannot-compute-distinct-aggregates.aspx
Why MS SQL don't support distinct aggregations? I think - because
every distinct aggregations in MS SQL require hash, and many
aggregations - it's so slow.

Thank you!
2009/5/10 Pavel Stehule <pavel.stehule@gmail.com>:

Show quoted text

Hello

some other info is on http://wiki.postgresql.org/wiki/Grouping_Sets

Maybe in e few weak I'll have some a prototype based on CTE. My older
technique based on hashed tables should be well, but it carry lot of
unshared code. Using CTE means so we can optimize CTE and GROUPING
SETS together. I plan to transform query

SELECT * FROM some GROUP BY GROUPING SETS(a, b)

to

WITH q AS (SELECT * FROM some)
SELECT * FROM q GROUP BY a
UNION ALL
SELECT * FROM q GROUP BY b

2009/5/10 Олег Царев <zabivator@gmail.com>:

Hello all.
Please, approve my ideas for implementation.

Standart has feature T431: Extended grouping capabilities.
This feature i found in TODO-list:
http://wiki.postgresql.org/wiki/Todo -> SQL Commands -> TO DO

MS SQL 2005 partial support this feature:
http://www.kodyaz.com/articles/sql-server-2005-cube-rollup-cannot-compute-distinct-aggregates.aspx
http://blogs.msdn.com/craigfr/archive/2007/09/21/aggregation-with-rollup.aspx

MS SQL 2008 support this feature:
http://blogs.msdn.com/craigfr/archive/2007/10/11/grouping-sets-in-sql-server-2008.aspx

Oracle support this feature:
http://www.compshack.com/sql/oracle-group-rollup

So, it's short notes about GROUPING SETS, but more complete
information have in a official documentation of MS SQL and Oracle
(copyright limited for send as attach).

First. GROUPG SETS.

select A,B,C,SUM(D) from table group by GROUPING SETS( (A,B,C), (A),
() ) - it's example of use grouping sets.
Semantic of this construction - make group by over source more, than
one group of column.
It's very wide key - A,B C. In result set of this example we can find
result set of select select A,B,C,SUM(D) from table group by A,B,C -
as subset. It's mind: "GROUP BY A,B,C" - subset of "GROUP BY GROUPING
SETS( (A,B,C), (A), () )
Two subset - is GROUP BY A B, and instead C column we look NULL.
Third subset - GROUP BY (), instead A,B,C - NULL, one row - it's name
"GRAND TOTAL". - calculate over all subset without grouping

Also have function "GROUPING" it's function say about null - "real
null" (from table) or generated by "GROUP BY GROUPING SETS"

My point: this feature can implement over GROUP BY and UNION ALL
We can make decomposition of "GROUP BY GROUPING SETS( (A,B,C),(A),()
)" to select A,B,C fron table GROUP BY A,B,C .UNION ALL select
A,B,NULL from table group BY A,B UNION ALL NUll,NUll,NULL from table
group by();

So, it's very simple, don't require modification of executor and
callibrate cost - only parser and semantic anylysis,
'
So, ROLLUP(A1,...An) is alias to "GROUP BY GROUPING SETS(
(A1,...,An),(A1,...,An-1),... (An-1,An),(An),() ),
CUBE - analogue.

If this idea it's good - i can write code base on old patch
http://archives.postgresql.org/pgsql-hackers/2008-10/msg00838.php or
from clean list (as you wish).

This is suboptimal. When SELECT * FROM X GROUP BY GROUPING SETS ...,
where X is some joined data, then you repeat JOIN on every grouping
set. So your solution is simple for implementation, but it should be
really slow.

Regards
Pavel Stehule

In future i know how to implement ROLLUP more optimal (executor
iterator) and use this ROLLUP for optimisation another GROUP BY,
GROUPING SETS.

Thanks.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Pavel Stehule

pavel.stehule@gmail.com

about 17 years ago

In reply to: Олег Царев (#3)

Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

2009/5/10 Олег Царев <zabivator@gmail.com>:

Hello, Pavel.

I read you letter, and found some points:

a) Simple transformation any GROUPING SETS in GROUP BY + UNION ALL
require clone source for every group.
It's so slow.
My point: we can make for start simple implementation.
b) Your sollution it's so good, IMHO
WITH q AS (SELECT * FROM some)
SELECT * FROM q GROUP BY a
UNION ALL
SELECT * FROM q GROUP BY b
But can you describe me how it's used on planner (calculate cost) and exector.

look on CTE source code or ask to CTE author.

Sharing source - it's mind - we don't process tree, we process
DIRECTED GRAPH. Hard, many bugs, so on, not?

for example - code for DISTINCT is shared with code for aggregation.
My idea is based on similarity

GROUPING SETS is subset of CTE, nonrecursive CTE is subset of UNION ALL

PostgreSQL support sharing nodes of executor? How? Where i can read?

I don't know, what do you thing exactly sharing nodes? Before CTE I
had to write special holder node, but it isn't necessary now.

Next. Ok, we spool source. After we use simple hash group/sort group +
union all? Or more advanced techniques? If different group by - it's
not different from my idea, it's logic additional (spool source for
fast rewind)

I thing so trivial implementation via UNION ALL should work, but any
optimisations means lot of work - and when you add rewind
functionality, you will got trivial nonrecursive CTE implementation.
When I wrote patch, there wasn't possible an use CTE, so I played wit
own nodes. Now, the more clean solution is probably based on current
CTE base.

c) Don't forget also about node requiments - if we use hash table as
container for source - we reduce some sorting properties - and query
select A,B from TABLE group by ((A,B),(B)) order by a,b require
sorting on output of grouping sets.
It's mind - we need insert sort.

Yes, probably there should be possible some techniques - that should
to eliminate final sort op. I am not sure if it is really important.
Now I thinking what is the bigger devil a) patch complexity b) some
suboptimality (mainly redundant call of agg nodes). For me a.
Aggregates are not gratis, but significantly more expensive is
repeated seq data scan. So it is first goal. Later we should to thing
about next optimalizations.

d) We can make SORT ROLLUP and SORT REDUCE ROLLUP.
first - sort group by use sorting for grouping data and calculate aggregations.
So, 'order by A,B,C' also 'order by A,B' also order by 'A' also order by ''
What it's mind? It's mind we can use sort features for implement SORT
ROLLUP. For every sub group we calculate self aggregates... oh, i bad
say. Look on "select a,sum(b) from table group by a". Sort group
grouping by a, and for every a-group calculate aggregations (sum), for
ROLLUP A,B we can make three aggregations ; for A,B than A than () -
and use one path on source calculate aggregations on every step, and
for split source on different subsets.
It's more perfomance, than different group by and union all.

Maybe, I don't know. What I know:

* UNION ALL will have better result on indexed sets and MIN, MAX
aggregates. CTE isn't optimized now for these cases.
* UNION ALL will be slow for large sets (over cache),
* CTE needs some more optimalizations,
* pg core hackers dislike big and complex patches,
* pg core hackers like any improved current code base.

I am thinking so Grouping Sets based on CTE should be more commitable
code. It doesn't mean so your ideas are wrong, but these
optimalization should to work on CTE too.

select * from table group by rollup(a,b,c)

have to have generate same plan as

with q as (select * from table)
select * from q group by a,b,c
union all
select * from q group by a,b
union all
select * from q group by a
union all
select * from q;

and CTE is more general then Grouping Sets, so it is better do
optimalization over CTE than Grouping Sets.

Regards
Pavel Stehule

Show quoted text

Look for MS SQL:
http://www.kodyaz.com/articles/sql-server-2005-cube-rollup-cannot-compute-distinct-aggregates.aspx
Why MS SQL don't support distinct aggregations? I think - because
every distinct aggregations in MS SQL require hash, and many
aggregations - it's so slow.

Thank you!
2009/5/10 Pavel Stehule <pavel.stehule@gmail.com>:

Hello

some other info is on http://wiki.postgresql.org/wiki/Grouping_Sets

Maybe in e few weak I'll have some a prototype based on CTE. My older
technique based on hashed tables should be well, but it carry lot of
unshared code. Using CTE means so we can optimize CTE and GROUPING
SETS together. I plan to transform query

SELECT * FROM some GROUP BY GROUPING SETS(a, b)

to

WITH q AS (SELECT * FROM some)
SELECT * FROM q GROUP BY a
UNION ALL
SELECT * FROM q GROUP BY b

2009/5/10 Олег Царев <zabivator@gmail.com>:

Hello all.
Please, approve my ideas for implementation.

Standart has feature T431: Extended grouping capabilities.
This feature i found in TODO-list:
http://wiki.postgresql.org/wiki/Todo -> SQL Commands -> TO DO

MS SQL 2005 partial support this feature:
http://www.kodyaz.com/articles/sql-server-2005-cube-rollup-cannot-compute-distinct-aggregates.aspx
http://blogs.msdn.com/craigfr/archive/2007/09/21/aggregation-with-rollup.aspx

MS SQL 2008 support this feature:
http://blogs.msdn.com/craigfr/archive/2007/10/11/grouping-sets-in-sql-server-2008.aspx

Oracle support this feature:
http://www.compshack.com/sql/oracle-group-rollup

So, it's short notes about GROUPING SETS, but more complete
information have in a official documentation of MS SQL and Oracle
(copyright limited for send as attach).

First. GROUPG SETS.

select A,B,C,SUM(D) from table group by GROUPING SETS( (A,B,C), (A),
() ) - it's example of use grouping sets.
Semantic of this construction - make group by over source more, than
one group of column.
It's very wide key - A,B C. In result set of this example we can find
result set of select select A,B,C,SUM(D) from table group by A,B,C -
as subset. It's mind: "GROUP BY A,B,C" - subset of "GROUP BY GROUPING
SETS( (A,B,C), (A), () )
Two subset - is GROUP BY A B, and instead C column we look NULL.
Third subset - GROUP BY (), instead A,B,C - NULL, one row - it's name
"GRAND TOTAL". - calculate over all subset without grouping

Also have function "GROUPING" it's function say about null - "real
null" (from table) or generated by "GROUP BY GROUPING SETS"

My point: this feature can implement over GROUP BY and UNION ALL
We can make decomposition of "GROUP BY GROUPING SETS( (A,B,C),(A),()
)" to select A,B,C fron table GROUP BY A,B,C .UNION ALL select
A,B,NULL from table group BY A,B UNION ALL NUll,NUll,NULL from table
group by();

So, it's very simple, don't require modification of executor and
callibrate cost - only parser and semantic anylysis,
'
So, ROLLUP(A1,...An) is alias to "GROUP BY GROUPING SETS(
(A1,...,An),(A1,...,An-1),... (An-1,An),(An),() ),
CUBE - analogue.

If this idea it's good - i can write code base on old patch
http://archives.postgresql.org/pgsql-hackers/2008-10/msg00838.php or
from clean list (as you wish).

This is suboptimal. When SELECT * FROM X GROUP BY GROUPING SETS ...,
where X is some joined data, then you repeat JOIN on every grouping
set. So your solution is simple for implementation, but it should be
really slow.

Regards
Pavel Stehule

In future i know how to implement ROLLUP more optimal (executor
iterator) and use this ROLLUP for optimisation another GROUP BY,
GROUPING SETS.

Thanks.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Олег Царев

zabivator@gmail.com

about 17 years ago

In reply to: Pavel Stehule (#4)

Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

I will write separated mail about rollup.
Can you send me some links or information about "CTE"? What is it?
Also, I need some deep knownledge about postgresql aggregation
calculation (executor part) - can you help me (links, books, name of
source files, etc)?.
After get additional information i will can continue discussion.
Thanls
2009/5/10 Pavel Stehule <pavel.stehule@gmail.com>:

Show quoted text

2009/5/10 Олег Царев <zabivator@gmail.com>:

Hello, Pavel.

I read you letter, and found some points:

a) Simple transformation any GROUPING SETS in GROUP BY + UNION ALL
require clone source for every group.
It's so slow.
My point: we can make for start simple implementation.
b) Your sollution it's so good, IMHO
WITH q AS (SELECT * FROM some)
SELECT * FROM q GROUP BY a
UNION ALL
SELECT * FROM q GROUP BY b
But can you describe me how it's used on planner (calculate cost) and exector.

look on CTE source code or ask to CTE author.

Sharing source - it's mind - we don't process tree, we process
DIRECTED GRAPH. Hard, many bugs, so on, not?

for example - code for DISTINCT is shared with code for aggregation.
My idea is based on similarity

GROUPING SETS is subset of CTE, nonrecursive CTE is subset of UNION ALL

PostgreSQL support sharing nodes of executor? How? Where i can read?

I don't know, what do you thing exactly sharing nodes? Before CTE I
had to write special holder node, but it isn't necessary now.

Next. Ok, we spool source. After we use simple hash group/sort group +
union all? Or more advanced techniques? If different group by - it's
not different from my idea, it's logic additional (spool source for
fast rewind)

I thing so trivial implementation via UNION ALL should work, but any
optimisations means lot of work - and when you add rewind
functionality, you will got trivial nonrecursive CTE implementation.
When I wrote patch, there wasn't possible an use CTE, so I played wit
own nodes. Now, the more clean solution is probably based on current
CTE base.

c) Don't forget also about node requiments - if we use hash table as
container for source - we reduce some sorting properties - and query
select A,B from TABLE group by ((A,B),(B)) order by a,b require
sorting on output of grouping sets.
It's mind - we need insert sort.

Yes, probably there should be possible some techniques - that should
to eliminate final sort op. I am not sure if it is really important.
Now I thinking what is the bigger devil a) patch complexity b) some
suboptimality (mainly redundant call of agg nodes). For me a.
Aggregates are not gratis, but significantly more expensive is
repeated seq data scan. So it is first goal. Later we should to thing
about next optimalizations.

d) We can make SORT ROLLUP and SORT REDUCE ROLLUP.
first - sort group by use sorting for grouping data and calculate aggregations.
So, 'order by A,B,C' also 'order by A,B' also order by 'A' also order by ''
What it's mind? It's mind we can use sort features for implement SORT
ROLLUP. For every sub group we calculate self aggregates... oh, i bad
say. Look on "select a,sum(b) from table group by a". Sort group
grouping by a, and for every a-group calculate aggregations (sum), for
ROLLUP A,B we can make three aggregations ; for A,B than A than () -
and use one path on source calculate aggregations on every step, and
for split source on different subsets.
It's more perfomance, than different group by and union all.

Maybe, I don't know. What I know:

* UNION ALL will have better result on indexed sets and MIN, MAX
aggregates. CTE isn't optimized now for these cases.
* UNION ALL will be slow for large sets (over cache),
* CTE needs some more optimalizations,
* pg core hackers dislike big and complex patches,
* pg core hackers like any improved current code base.

I am thinking so Grouping Sets based on CTE should be more commitable
code. It doesn't mean so your ideas are wrong, but these
optimalization should to work on CTE too.

select * from table group by rollup(a,b,c)

have to have generate same plan as

with q as (select * from table)
select * from q group by a,b,c
union all
select * from q group by a,b
union all
select * from q group by a
union all
select * from q;

and CTE is more general then Grouping Sets, so it is better do
optimalization over CTE than Grouping Sets.

Regards
Pavel Stehule

Look for MS SQL:
http://www.kodyaz.com/articles/sql-server-2005-cube-rollup-cannot-compute-distinct-aggregates.aspx
Why MS SQL don't support distinct aggregations? I think - because
every distinct aggregations in MS SQL require hash, and many
aggregations - it's so slow.

Thank you!
2009/5/10 Pavel Stehule <pavel.stehule@gmail.com>:

Hello

some other info is on http://wiki.postgresql.org/wiki/Grouping_Sets

Maybe in e few weak I'll have some a prototype based on CTE. My older
technique based on hashed tables should be well, but it carry lot of
unshared code. Using CTE means so we can optimize CTE and GROUPING
SETS together. I plan to transform query

SELECT * FROM some GROUP BY GROUPING SETS(a, b)

to

WITH q AS (SELECT * FROM some)
SELECT * FROM q GROUP BY a
UNION ALL
SELECT * FROM q GROUP BY b

2009/5/10 Олег Царев <zabivator@gmail.com>:

Hello all.
Please, approve my ideas for implementation.

Standart has feature T431: Extended grouping capabilities.
This feature i found in TODO-list:
http://wiki.postgresql.org/wiki/Todo -> SQL Commands -> TO DO

MS SQL 2005 partial support this feature:
http://www.kodyaz.com/articles/sql-server-2005-cube-rollup-cannot-compute-distinct-aggregates.aspx
http://blogs.msdn.com/craigfr/archive/2007/09/21/aggregation-with-rollup.aspx

MS SQL 2008 support this feature:
http://blogs.msdn.com/craigfr/archive/2007/10/11/grouping-sets-in-sql-server-2008.aspx

Oracle support this feature:
http://www.compshack.com/sql/oracle-group-rollup

So, it's short notes about GROUPING SETS, but more complete
information have in a official documentation of MS SQL and Oracle
(copyright limited for send as attach).

First. GROUPG SETS.

select A,B,C,SUM(D) from table group by GROUPING SETS( (A,B,C), (A),
() ) - it's example of use grouping sets.
Semantic of this construction - make group by over source more, than
one group of column.
It's very wide key - A,B C. In result set of this example we can find
result set of select select A,B,C,SUM(D) from table group by A,B,C -
as subset. It's mind: "GROUP BY A,B,C" - subset of "GROUP BY GROUPING
SETS( (A,B,C), (A), () )
Two subset - is GROUP BY A B, and instead C column we look NULL.
Third subset - GROUP BY (), instead A,B,C - NULL, one row - it's name
"GRAND TOTAL". - calculate over all subset without grouping

Also have function "GROUPING" it's function say about null - "real
null" (from table) or generated by "GROUP BY GROUPING SETS"

My point: this feature can implement over GROUP BY and UNION ALL
We can make decomposition of "GROUP BY GROUPING SETS( (A,B,C),(A),()
)" to select A,B,C fron table GROUP BY A,B,C .UNION ALL select
A,B,NULL from table group BY A,B UNION ALL NUll,NUll,NULL from table
group by();

So, it's very simple, don't require modification of executor and
callibrate cost - only parser and semantic anylysis,
'
So, ROLLUP(A1,...An) is alias to "GROUP BY GROUPING SETS(
(A1,...,An),(A1,...,An-1),... (An-1,An),(An),() ),
CUBE - analogue.

If this idea it's good - i can write code base on old patch
http://archives.postgresql.org/pgsql-hackers/2008-10/msg00838.php or
from clean list (as you wish).

This is suboptimal. When SELECT * FROM X GROUP BY GROUPING SETS ...,
where X is some joined data, then you repeat JOIN on every grouping
set. So your solution is simple for implementation, but it should be
really slow.

Regards
Pavel Stehule

In future i know how to implement ROLLUP more optimal (executor
iterator) and use this ROLLUP for optimisation another GROUP BY,
GROUPING SETS.

Thanks.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Hitoshi Harada

umi.tanuki@gmail.com

about 17 years ago

In reply to: Pavel Stehule (#4)

Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

2009/5/11 Pavel Stehule <pavel.stehule@gmail.com>:

I am thinking so Grouping Sets based on CTE should be more commitable
code. It doesn't mean so your ideas are wrong, but these
optimalization should to work on CTE too.

select * from table group by rollup(a,b,c)

have to have generate same plan as

with q as (select * from table)
select * from q group by a,b,c
union all
select * from q group by a,b
union all
select * from q group by a
union all
select * from q;

and CTE is more general then Grouping Sets, so it is better do
optimalization over CTE than Grouping Sets.

If you need to buffer tuples from the outer plan and to rescan it
multiple times, tuplestore seems more appropriate solution than using
CTE node, from semantic point of view. During CTE and window functions
development, tuplestore now has that kind of capability and CTE node
is only a wrapper of tuplestore.

Moreover, I guess you don't even need to buffer tuples to aggregate by
different keys. What you have to do is only to prepare more than one
hash tables (, or set up sort order if the plan detects hash table is
too large to fit in the memory), and one time seq scan will do. The
trans values are only to be stored in the memory, not the outer plan's
results. It will win greately in performance.

Regards,

--
Hitoshi Harada

Pavel Stehule

pavel.stehule@gmail.com

about 17 years ago

In reply to: Hitoshi Harada (#6)

Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

2009/5/12 Hitoshi Harada <umi.tanuki@gmail.com>:

2009/5/11 Pavel Stehule <pavel.stehule@gmail.com>:

I am thinking so Grouping Sets based on CTE should be more commitable
code. It doesn't mean so your ideas are wrong, but these
optimalization should to work on CTE too.

select * from table group by rollup(a,b,c)

have to have generate same plan as

with q as (select * from table)
select * from q group by a,b,c
union all
select * from q group by a,b
union all
select * from q group by a
union all
select * from q;

and CTE is more general then Grouping Sets, so it is better do
optimalization over CTE than Grouping Sets.

If you need to buffer tuples from the outer plan and to rescan it
multiple times, tuplestore seems more appropriate solution than using
CTE node, from semantic point of view. During CTE and window functions
development, tuplestore now has that kind of capability and CTE node
is only a wrapper of tuplestore.

Moreover, I guess you don't even need to buffer tuples to aggregate by
different keys. What you have to do is only to prepare more than one
hash tables (, or set up sort order if the plan detects hash table is
too large to fit in the memory), and one time seq scan will do. The
trans values are only to be stored in the memory, not the outer plan's
results. It will win greately in performance.

it was my first solution. But I would to prepare one non hash method.
But now I thinking about some special executor node, that fill all
necessary hash parallel. It's special variant of hash agreggate.

regards
Pavel Stehule

Show quoted text

Regards,

--
Hitoshi Harada

Robert Haas

robertmhaas@gmail.com

about 17 years ago

In reply to: Pavel Stehule (#7)

Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

On Tue, May 12, 2009 at 2:21 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:

Moreover, I guess you don't even need to buffer tuples to aggregate by
different keys. What you have to do is only to prepare more than one
hash tables (, or set up sort order if the plan detects hash table is
too large to fit in the memory), and one time seq scan will do. The
trans values are only to be stored in the memory, not the outer plan's
results. It will win greately in performance.

it was my first solution. But I would to prepare one non hash method.
But now I thinking about some special executor node, that fill all
necessary hash parallel. It's special variant of hash agreggate.

I think HashAggregate will often be the fastest method of executing
this kind of operation, but it would be nice to have an alternative
(such as repeatedly sorting a tuplestore) to handle non-hashable
datatypes or cases where the HashAggregate would eat too much memory.

But that leads me to a question - does the existing HashAggregate code
make any attempt to obey work_mem? I know that the infrastructure is
present for HashJoin/Hash, but on a quick pass I didn't notice
anything similar in HashAggregate.

And on a slightly off-topic note for this thread, is there any
compelling reason why we have at least three different hash
implementations in the executor? HashJoin/Hash uses one for regular
batches and one for the skew batch, and I believe that HashAggregate
does something else entirely. It seems like it might improve code
maintainability, if nothing else, to unify these to the extent
possible.

...Robert

Pavel Stehule

pavel.stehule@gmail.com

about 17 years ago

In reply to: Олег Царев (#1)

Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

Hello Oleg

I am sending a new CTE based variant of my GROUPING SETS patch,

this patch has some bugs but it is good prototype (it's more stable
than old patch):

postgres=# select selling_date, baguette, sum(items) from
baguette_selling group by grouping sets(1,2);
selling_date | baguette | sum
--------------+----------+-----
2007-10-30 | | 17
2007-10-31 | | 12
| golf | 9
| buster | 20
(4 rows)

postgres=# select selling_date, baguette, sum(items),
grouping(selling_date), grouping(baguette), grouping_id(selling_date,
baguette) from baguette_selling group by grouping sets(1,2);
selling_date | baguette | sum | grouping | grouping | grouping_id
--------------+----------+-----+----------+----------+-------------
2007-10-30 | | 17 | 1 | 0 | 2
2007-10-31 | | 12 | 1 | 0 | 2
| golf | 9 | 0 | 1 | 1
| buster | 20 | 0 | 1 | 1
(4 rows)

postgres=# select selling_date, baguette, sum(items),
grouping(selling_date), grouping(baguette), grouping_id(selling_date,
baguette) from baguette_selling group by grouping sets(1,2,());
selling_date | baguette | sum | grouping | grouping | grouping_id
--------------+----------+-----+----------+----------+-------------
2007-10-30 | | 17 | 1 | 0 | 2
2007-10-31 | | 12 | 1 | 0 | 2
| golf | 9 | 0 | 1 | 1
| buster | 20 | 0 | 1 | 1
| | 29 | 0 | 0 | 0
(5 rows)

I thing so parser part is well and correct (and ported to 8.4).

CTE works well, but not 100% effective, and will be better to use
direct tuplestore interface (as second technique - when hash tables
can't to be used).

I am thinking, so the best solution is enhancing current Aggregate
node for support of GroupingSets. The code base on UNION ALL is +/-
equal to CTE, and I don't thing, so this should be optimal. But work
freely, please. I have not free time for this patch next two months.
So if you have time, it's your.

regards
Pavel Stehule

2009/5/10 Олег Царев <zabivator@gmail.com>:

Show quoted text

Hello all.
Please, approve my ideas for implementation.

Standart has feature T431: Extended grouping capabilities.
This feature i found in TODO-list:
http://wiki.postgresql.org/wiki/Todo -> SQL Commands -> TO DO

MS SQL 2005 partial support this feature:
http://www.kodyaz.com/articles/sql-server-2005-cube-rollup-cannot-compute-distinct-aggregates.aspx
http://blogs.msdn.com/craigfr/archive/2007/09/21/aggregation-with-rollup.aspx

MS SQL 2008 support this feature:
http://blogs.msdn.com/craigfr/archive/2007/10/11/grouping-sets-in-sql-server-2008.aspx

Oracle support this feature:
http://www.compshack.com/sql/oracle-group-rollup

So, it's short notes about GROUPING SETS, but more complete
information have in a official documentation of MS SQL and Oracle
(copyright limited for send as attach).

First. GROUPG SETS.

select A,B,C,SUM(D) from table group by GROUPING SETS( (A,B,C), (A),
() ) - it's example of use grouping sets.
Semantic of this construction - make group by over source more, than
one group of column.
It's very wide key - A,B C. In result set of this example we can find
result set of select select A,B,C,SUM(D) from table group by A,B,C -
as subset. It's mind: "GROUP BY A,B,C" - subset of "GROUP BY GROUPING
SETS( (A,B,C), (A), () )
Two subset - is GROUP BY A B, and instead C column we look NULL.
Third subset - GROUP BY (), instead A,B,C - NULL, one row - it's name
"GRAND TOTAL". - calculate over all subset without grouping

Also have function "GROUPING" it's function say about null - "real
null" (from table) or generated by "GROUP BY GROUPING SETS"

My point: this feature can implement over GROUP BY and UNION ALL
We can make decomposition of "GROUP BY GROUPING SETS( (A,B,C),(A),()
)" to select A,B,C fron table GROUP BY A,B,C .UNION ALL select
A,B,NULL from table group BY A,B UNION ALL NUll,NUll,NULL from table
group by();

So, it's very simple, don't require modification of executor and
callibrate cost - only parser and semantic anylysis,
'
So, ROLLUP(A1,...An) is alias to "GROUP BY GROUPING SETS(
(A1,...,An),(A1,...,An-1),... (An-1,An),(An),() ),
CUBE - analogue.

If this idea it's good - i can write code base on old patch
http://archives.postgresql.org/pgsql-hackers/2008-10/msg00838.php or
from clean list (as you wish).

In future i know how to implement ROLLUP more optimal (executor
iterator) and use this ROLLUP for optimisation another GROUP BY,
GROUPING SETS.

Thanks.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Олег Царев

zabivator@gmail.com

about 17 years ago

In reply to: Pavel Stehule (#9)

Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

Hello all!
If no one objecte (all agree, in other say) i continue work on patch -
particulary, i want support second strategy (tuple store instead of
hash-table) for save order of source (more cheap solution in case with
grouping sets + order by), investigate and brainstorm another
optimisation, writing regression tests and technical documentation.
But I need some time for complete my investigation internals of
PostgreSQL, particulary CTE.

Thanks.

2009/5/13 Pavel Stehule <pavel.stehule@gmail.com>:

Show quoted text

Hello Oleg

I am sending a new CTE based variant of my GROUPING SETS patch,

this patch has some bugs but it is good prototype (it's more stable
than old patch):

postgres=# select selling_date, baguette, sum(items) from
baguette_selling group by grouping sets(1,2);
selling_date | baguette | sum
--------------+----------+-----
2007-10-30 | | 17
2007-10-31 | | 12
| golf | 9
| buster | 20
(4 rows)

postgres=# select selling_date, baguette, sum(items),
grouping(selling_date), grouping(baguette), grouping_id(selling_date,
baguette) from baguette_selling group by grouping sets(1,2);
selling_date | baguette | sum | grouping | grouping | grouping_id
--------------+----------+-----+----------+----------+-------------
2007-10-30 | | 17 | 1 | 0 | 2
2007-10-31 | | 12 | 1 | 0 | 2
| golf | 9 | 0 | 1 | 1
| buster | 20 | 0 | 1 | 1
(4 rows)

postgres=# select selling_date, baguette, sum(items),
grouping(selling_date), grouping(baguette), grouping_id(selling_date,
baguette) from baguette_selling group by grouping sets(1,2,());
selling_date | baguette | sum | grouping | grouping | grouping_id
--------------+----------+-----+----------+----------+-------------
2007-10-30 | | 17 | 1 | 0 | 2
2007-10-31 | | 12 | 1 | 0 | 2
| golf | 9 | 0 | 1 | 1
| buster | 20 | 0 | 1 | 1
| | 29 | 0 | 0 | 0
(5 rows)

I thing so parser part is well and correct (and ported to 8.4).

CTE works well, but not 100% effective, and will be better to use
direct tuplestore interface (as second technique - when hash tables
can't to be used).

I am thinking, so the best solution is enhancing current Aggregate
node for support of GroupingSets. The code base on UNION ALL is +/-
equal to CTE, and I don't thing, so this should be optimal. But work
freely, please. I have not free time for this patch next two months.
So if you have time, it's your.

regards
Pavel Stehule

2009/5/10 Олег Царев <zabivator@gmail.com>:

Hello all.
Please, approve my ideas for implementation.

Standart has feature T431: Extended grouping capabilities.
This feature i found in TODO-list:
http://wiki.postgresql.org/wiki/Todo -> SQL Commands -> TO DO

MS SQL 2005 partial support this feature:
http://www.kodyaz.com/articles/sql-server-2005-cube-rollup-cannot-compute-distinct-aggregates.aspx
http://blogs.msdn.com/craigfr/archive/2007/09/21/aggregation-with-rollup.aspx

MS SQL 2008 support this feature:
http://blogs.msdn.com/craigfr/archive/2007/10/11/grouping-sets-in-sql-server-2008.aspx

Oracle support this feature:
http://www.compshack.com/sql/oracle-group-rollup

So, it's short notes about GROUPING SETS, but more complete
information have in a official documentation of MS SQL and Oracle
(copyright limited for send as attach).

First. GROUPG SETS.

select A,B,C,SUM(D) from table group by GROUPING SETS( (A,B,C), (A),
() ) - it's example of use grouping sets.
Semantic of this construction - make group by over source more, than
one group of column.
It's very wide key - A,B C. In result set of this example we can find
result set of select select A,B,C,SUM(D) from table group by A,B,C -
as subset. It's mind: "GROUP BY A,B,C" - subset of "GROUP BY GROUPING
SETS( (A,B,C), (A), () )
Two subset - is GROUP BY A B, and instead C column we look NULL.
Third subset - GROUP BY (), instead A,B,C - NULL, one row - it's name
"GRAND TOTAL". - calculate over all subset without grouping

Also have function "GROUPING" it's function say about null - "real
null" (from table) or generated by "GROUP BY GROUPING SETS"

My point: this feature can implement over GROUP BY and UNION ALL
We can make decomposition of "GROUP BY GROUPING SETS( (A,B,C),(A),()
)" to select A,B,C fron table GROUP BY A,B,C .UNION ALL select
A,B,NULL from table group BY A,B UNION ALL NUll,NUll,NULL from table
group by();

So, it's very simple, don't require modification of executor and
callibrate cost - only parser and semantic anylysis,
'
So, ROLLUP(A1,...An) is alias to "GROUP BY GROUPING SETS(
(A1,...,An),(A1,...,An-1),... (An-1,An),(An),() ),
CUBE - analogue.

If this idea it's good - i can write code base on old patch
http://archives.postgresql.org/pgsql-hackers/2008-10/msg00838.php or
from clean list (as you wish).

In future i know how to implement ROLLUP more optimal (executor
iterator) and use this ROLLUP for optimisation another GROUP BY,
GROUPING SETS.

Thanks.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Joshua Tolley

eggyknap@gmail.com

about 17 years ago

In reply to: Pavel Stehule (#9)

Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

On Tue, May 12, 2009 at 11:20:14PM +0200, Pavel Stehule wrote:

this patch has some bugs but it is good prototype (it's more stable
than old patch):

I'm not sure if you're at the point that you're interested in bug reports, but
here's something that didn't behave as expected:

5432 josh@josh*# create table gsettest (prod_id integer, cust_id integer,
quantity integer);
CREATE TABLE
5432 josh@josh*# insert into gsettest select floor(random() * 10)::int,
floor(random() * 20)::int, floor(random() * 10)::int from generate_series(1,
100);
INSERT 0 100
5432 josh@josh*# select prod_id, cust_id, sum(quantity) from gsettest group by
cube (prod_id, cust_id) order by 1, 2;
prod_id | cust_id | sum
---------+---------+-----
5 | 7 | 4
8 | 16 | 3
9 | 19 | 8
4 | 13 | 3
8 | 8 | 15
5 | 2 | 4
7 | 6 | 7
6 | 6 | 3
</snip>

Note that the results aren't sorted. The following, though, works around it:

5432 josh@josh*# select * from (select prod_id, cust_id, sum(quantity) from
gsettest group by cube (prod_id, cust_id)) f order by 1, 2;
prod_id | cust_id | sum
---------+---------+-----
0 | 2 | 8
0 | 4 | 8
0 | 5 | 2
0 | 7 | 11
0 | 8 | 7
0 | 9 | 1
0 | 12 | 3
0 | 14 | 7
0 | 16 | 5
0 | 17 | 8
0 | 18 | 9
0 | 19 | 2
0 | | 71
</snip>

EXPLAIN output is as follows:
5432 josh@josh*# explain select prod_id, cust_id, sum(quantity) from gsettest
group by cube (prod_id, cust_id) order by 1, 2;
QUERY PLAN
---------------------------------------------------------------------------
Append (cost=193.54..347.71 rows=601 width=9)
CTE **g**
-> Sort (cost=135.34..140.19 rows=1940 width=12)
Sort Key: gsettest.prod_id, gsettest.cust_id
-> Seq Scan on gsettest (cost=0.00..29.40 rows=1940 width=12)
-> HashAggregate (cost=53.35..55.85 rows=200 width=12)
-> CTE Scan on "**g**" (cost=0.00..38.80 rows=1940 width=12)
-> HashAggregate (cost=48.50..51.00 rows=200 width=8)
-> CTE Scan on "**g**" (cost=0.00..38.80 rows=1940 width=8)
-> HashAggregate (cost=48.50..51.00 rows=200 width=8)
-> CTE Scan on "**g**" (cost=0.00..38.80 rows=1940 width=8)
-> Aggregate (cost=43.65..43.66 rows=1 width=4)
-> CTE Scan on "**g**" (cost=0.00..38.80 rows=1940 width=4)
(13 rows)

...and without the ORDER BY clause just to prove that it really is the reason
for the Sort step...

5432 josh@josh*# explain select prod_id, cust_id, sum(quantity) from gsettest
group by cube (prod_id, cust_id);
QUERY PLAN
------------------------------------------------------------------------
Append (cost=82.75..236.92 rows=601 width=9)
CTE **g**
-> Seq Scan on gsettest (cost=0.00..29.40 rows=1940 width=12)
-> HashAggregate (cost=53.35..55.85 rows=200 width=12)
-> CTE Scan on "**g**" (cost=0.00..38.80 rows=1940 width=12)
-> HashAggregate (cost=48.50..51.00 rows=200 width=8)
-> CTE Scan on "**g**" (cost=0.00..38.80 rows=1940 width=8)
-> HashAggregate (cost=48.50..51.00 rows=200 width=8)
-> CTE Scan on "**g**" (cost=0.00..38.80 rows=1940 width=8)
-> Aggregate (cost=43.65..43.66 rows=1 width=4)
-> CTE Scan on "**g**" (cost=0.00..38.80 rows=1940 width=4)
(11 rows)

I'm hoping I'll get a chance to poke at the patch some. This could be very
useful...

- Josh / eggyknap

#12

Pavel Stehule

pavel.stehule@gmail.com

about 17 years ago

In reply to: Joshua Tolley (#11)

Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

2009/5/13 Joshua Tolley <eggyknap@gmail.com>:

On Tue, May 12, 2009 at 11:20:14PM +0200, Pavel Stehule wrote:

this patch has some bugs but it is good prototype (it's more stable
than old patch):

I'm not sure if you're at the point that you're interested in bug reports, but
here's something that didn't behave as expected:

5432 josh@josh*# create table gsettest (prod_id integer, cust_id integer,
quantity integer);
CREATE TABLE
5432 josh@josh*# insert into gsettest select floor(random() * 10)::int,
floor(random() * 20)::int, floor(random() * 10)::int from generate_series(1,
100);
INSERT 0 100
5432 josh@josh*# select prod_id, cust_id, sum(quantity) from gsettest group by
cube (prod_id, cust_id) order by 1, 2;
prod_id | cust_id | sum
---------+---------+-----
5 | 7 | 4
8 | 16 | 3
9 | 19 | 8
4 | 13 | 3
8 | 8 | 15
5 | 2 | 4
7 | 6 | 7
6 | 6 | 3
</snip>

Note that the results aren't sorted. The following, though, works around it:

I thing, so result should not be sorted - it's same like normal group by.

regards
Pavel Stehule

Show quoted text

5432 josh@josh*# select * from (select prod_id, cust_id, sum(quantity) from
gsettest group by cube (prod_id, cust_id)) f order by 1, 2;
prod_id | cust_id | sum
---------+---------+-----
0 | 2 | 8
0 | 4 | 8
0 | 5 | 2
0 | 7 | 11
0 | 8 | 7
0 | 9 | 1
0 | 12 | 3
0 | 14 | 7
0 | 16 | 5
0 | 17 | 8
0 | 18 | 9
0 | 19 | 2
0 | | 71
</snip>

EXPLAIN output is as follows:
5432 josh@josh*# explain select prod_id, cust_id, sum(quantity) from gsettest
group by cube (prod_id, cust_id) order by 1, 2;
QUERY PLAN
---------------------------------------------------------------------------
Append (cost=193.54..347.71 rows=601 width=9)
CTE **g**
-> Sort (cost=135.34..140.19 rows=1940 width=12)
Sort Key: gsettest.prod_id, gsettest.cust_id
-> Seq Scan on gsettest (cost=0.00..29.40 rows=1940 width=12)
-> HashAggregate (cost=53.35..55.85 rows=200 width=12)
-> CTE Scan on "**g**" (cost=0.00..38.80 rows=1940 width=12)
-> HashAggregate (cost=48.50..51.00 rows=200 width=8)
-> CTE Scan on "**g**" (cost=0.00..38.80 rows=1940 width=8)
-> HashAggregate (cost=48.50..51.00 rows=200 width=8)
-> CTE Scan on "**g**" (cost=0.00..38.80 rows=1940 width=8)
-> Aggregate (cost=43.65..43.66 rows=1 width=4)
-> CTE Scan on "**g**" (cost=0.00..38.80 rows=1940 width=4)
(13 rows)

...and without the ORDER BY clause just to prove that it really is the reason
for the Sort step...

5432 josh@josh*# explain select prod_id, cust_id, sum(quantity) from gsettest
group by cube (prod_id, cust_id);
QUERY PLAN
------------------------------------------------------------------------
Append (cost=82.75..236.92 rows=601 width=9)
CTE **g**
-> Seq Scan on gsettest (cost=0.00..29.40 rows=1940 width=12)
-> HashAggregate (cost=53.35..55.85 rows=200 width=12)
-> CTE Scan on "**g**" (cost=0.00..38.80 rows=1940 width=12)
-> HashAggregate (cost=48.50..51.00 rows=200 width=8)
-> CTE Scan on "**g**" (cost=0.00..38.80 rows=1940 width=8)
-> HashAggregate (cost=48.50..51.00 rows=200 width=8)
-> CTE Scan on "**g**" (cost=0.00..38.80 rows=1940 width=8)
-> Aggregate (cost=43.65..43.66 rows=1 width=4)
-> CTE Scan on "**g**" (cost=0.00..38.80 rows=1940 width=4)
(11 rows)

I'm hoping I'll get a chance to poke at the patch some. This could be very
useful...

- Josh / eggyknap

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAkoKOdUACgkQRiRfCGf1UMOpFQCeJGQftMheSi6blMwheK4HI89p
E7cAnjdWi4FaerR/+RTBeSv9Zc0RRXQ3
=xW04
-----END PGP SIGNATURE-----

#13

Joshua Tolley

eggyknap@gmail.com

about 17 years ago

In reply to: Pavel Stehule (#12)

Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

On Wed, May 13, 2009 at 06:29:41AM +0200, Pavel Stehule wrote:

2009/5/13 Joshua Tolley <eggyknap@gmail.com>:

On Tue, May 12, 2009 at 11:20:14PM +0200, Pavel Stehule wrote:

this patch has some bugs but it is good prototype (it's more stable
than old patch):

I'm not sure if you're at the point that you're interested in bug reports, but
here's something that didn't behave as expected:

5432 josh@josh*# create table gsettest (prod_id integer, cust_id integer,
quantity integer);
CREATE TABLE
5432 josh@josh*# insert into gsettest select floor(random() * 10)::int,
floor(random() * 20)::int, floor(random() * 10)::int from generate_series(1,
100);
INSERT 0 100
5432 josh@josh*# select prod_id, cust_id, sum(quantity) from gsettest group by
cube (prod_id, cust_id) order by 1, 2;
prod_id | cust_id | sum
---------+---------+-----
5 | 7 | 4
8 | 16 | 3
9 | 19 | 8
4 | 13 | 3
8 | 8 | 15
5 | 2 | 4
7 | 6 | 7
6 | 6 | 3
</snip>

Note that the results aren't sorted. The following, though, works around it:

I thing, so result should not be sorted - it's same like normal group by.

Normal GROUP BY wouldn't have ignored the ORDER BY clause I included.

- Josh

#14

Pavel Stehule

pavel.stehule@gmail.com

about 17 years ago

In reply to: Joshua Tolley (#13)

Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

2009/5/13 Joshua Tolley <eggyknap@gmail.com>:

On Wed, May 13, 2009 at 06:29:41AM +0200, Pavel Stehule wrote:

2009/5/13 Joshua Tolley <eggyknap@gmail.com>:

On Tue, May 12, 2009 at 11:20:14PM +0200, Pavel Stehule wrote:

this patch has some bugs but it is good prototype (it's more stable
than old patch):

I'm not sure if you're at the point that you're interested in bug reports, but
here's something that didn't behave as expected:

5432 josh@josh*# create table gsettest (prod_id integer, cust_id integer,
quantity integer);
CREATE TABLE
5432 josh@josh*# insert into gsettest select floor(random() * 10)::int,
floor(random() * 20)::int, floor(random() * 10)::int from generate_series(1,
100);
INSERT 0 100
5432 josh@josh*# select prod_id, cust_id, sum(quantity) from gsettest group by
cube (prod_id, cust_id) order by 1, 2;
prod_id | cust_id | sum
---------+---------+-----
5 | 7 | 4
8 | 16 | 3
9 | 19 | 8
4 | 13 | 3
8 | 8 | 15
5 | 2 | 4
7 | 6 | 7
6 | 6 | 3
</snip>

Note that the results aren't sorted. The following, though, works around it:

I thing, so result should not be sorted - it's same like normal group by.

Normal GROUP BY wouldn't have ignored the ORDER BY clause I included.

sorry, now I understand - simply it is a bug. I fixed it

Thank You
Pavel

Show quoted text

- Josh

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAkoKxLQACgkQRiRfCGf1UMOj/wCgkPnRiheRr+BNPLBCjzA9XlFW
0rsAoI0eOGSGlxIv0eNB8zqum7kw/Cqw
=FCTz
-----END PGP SIGNATURE-----

#15

Tom Lane

tgl@sss.pgh.pa.us

about 17 years ago

In reply to: Robert Haas (#8)

Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

Robert Haas <robertmhaas@gmail.com> writes:

But that leads me to a question - does the existing HashAggregate code
make any attempt to obey work_mem?

No.

regards, tom lane

#16

David Fetter

david@fetter.org

about 17 years ago

In reply to: Pavel Stehule (#14)

Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

On Wed, May 13, 2009 at 03:12:51PM +0200, Pavel Stehule wrote:

2009/5/13 Joshua Tolley <eggyknap@gmail.com>:

On Wed, May 13, 2009 at 06:29:41AM +0200, Pavel Stehule wrote:

2009/5/13 Joshua Tolley <eggyknap@gmail.com>:

On Tue, May 12, 2009 at 11:20:14PM +0200, Pavel Stehule wrote:

this patch has some bugs but it is good prototype (it's more stable
than old patch):

I'm not sure if you're at the point that you're interested in bug reports, but
here's something that didn't behave as expected:

5432 josh@josh*# create table gsettest (prod_id integer, cust_id integer,
quantity integer);
CREATE TABLE
5432 josh@josh*# insert into gsettest select floor(random() * 10)::int,
floor(random() * 20)::int, floor(random() * 10)::int from generate_series(1,
100);
INSERT 0 100
5432 josh@josh*# select prod_id, cust_id, sum(quantity) from gsettest group by
cube (prod_id, cust_id) order by 1, 2;
ï¿½prod_id | cust_id | sum
---------+---------+-----
ï¿½ ï¿½ ï¿½ 5 | ï¿½ ï¿½ ï¿½ 7 | ï¿½ 4
ï¿½ ï¿½ ï¿½ 8 | ï¿½ ï¿½ ï¿½16 | ï¿½ 3
ï¿½ ï¿½ ï¿½ 9 | ï¿½ ï¿½ ï¿½19 | ï¿½ 8
ï¿½ ï¿½ ï¿½ 4 | ï¿½ ï¿½ ï¿½13 | ï¿½ 3
ï¿½ ï¿½ ï¿½ 8 | ï¿½ ï¿½ ï¿½ 8 | ï¿½15
ï¿½ ï¿½ ï¿½ 5 | ï¿½ ï¿½ ï¿½ 2 | ï¿½ 4
ï¿½ ï¿½ ï¿½ 7 | ï¿½ ï¿½ ï¿½ 6 | ï¿½ 7
ï¿½ ï¿½ ï¿½ 6 | ï¿½ ï¿½ ï¿½ 6 | ï¿½ 3
</snip>

Note that the results aren't sorted. The following, though, works around it:

I thing, so result should not be sorted - it's same like normal group by.

Normal GROUP BY wouldn't have ignored the ORDER BY clause I included.

sorry, now I understand - simply it is a bug. I fixed it

Where's the new patch?

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

#17

Pavel Stehule

pavel.stehule@gmail.com

about 17 years ago

In reply to: David Fetter (#16)

Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

Here is.

I checked result with Oracle and basic results are same with one exception

this patch doesn't well do with expr specified sets

this result is correct

postgres=# select selling_date, baguette, canteen, sum(items),
grouping(baguette), grouping(selling_date),
grouping_id(baguette,selling_date) from baguette_selling group by
grouping sets(baguette, selling_date, canteen,());
selling_date | baguette | canteen | sum | grouping | grouping | grouping_id
--------------+----------+---------+-----+----------+----------+-------------
| golf | | 9 | 0 | 1 | 1
| buster | | 20 | 0 | 1 | 1
2007-10-30 | | | 17 | 1 | 0 | 2
2007-10-31 | | | 12 | 1 | 0 | 2
| | Prague | 14 | 1 | 1 | 3
| | Berlin | 15 | 1 | 1 | 3
| | | 29 | 1 | 1 | 3
(7 rows)

but this result not:

postgres=# select extract(day from selling_date), selling_date,
baguette, canteen, sum(items), grouping(baguette),
grouping(selling_date), grouping_id(baguette,selling_date) from
baguette_selling group by grouping sets(baguette, selling_date,
canteen, extract(day from selling_date))
;
date_part | selling_date | baguette | canteen | sum | grouping |
grouping | grouping_id
-----------+--------------+----------+---------+-----+----------+----------+-------------
| | golf | | 9 | 0 |
1 | 1
| | buster | | 20 | 0 |
1 | 1
30 | 2007-10-30 | | | 17 | 1 |
0 | 2
31 | 2007-10-31 | | | 12 | 1 |
0 | 2
| | | Prague | 14 | 1 |
1 | 3
| | | Berlin | 15 | 1 |
1 | 3
| | | | 29 | 1 |
1 | 3
(7 rows)

date_part column is problematic.

regards
Pavel Stehule

2009/5/13 David Fetter <david@fetter.org>:

Show quoted text

On Wed, May 13, 2009 at 03:12:51PM +0200, Pavel Stehule wrote:

2009/5/13 Joshua Tolley <eggyknap@gmail.com>:

On Wed, May 13, 2009 at 06:29:41AM +0200, Pavel Stehule wrote:

2009/5/13 Joshua Tolley <eggyknap@gmail.com>:

On Tue, May 12, 2009 at 11:20:14PM +0200, Pavel Stehule wrote:

this patch has some bugs but it is good prototype (it's more stable
than old patch):

I'm not sure if you're at the point that you're interested in bug reports, but
here's something that didn't behave as expected:

5432 josh@josh*# create table gsettest (prod_id integer, cust_id integer,
quantity integer);
CREATE TABLE
5432 josh@josh*# insert into gsettest select floor(random() * 10)::int,
floor(random() * 20)::int, floor(random() * 10)::int from generate_series(1,
100);
INSERT 0 100
5432 josh@josh*# select prod_id, cust_id, sum(quantity) from gsettest group by
cube (prod_id, cust_id) order by 1, 2;
prod_id | cust_id | sum
---------+---------+-----
5 | 7 | 4
8 | 16 | 3
9 | 19 | 8
4 | 13 | 3
8 | 8 | 15
5 | 2 | 4
7 | 6 | 7
6 | 6 | 3
</snip>

Note that the results aren't sorted. The following, though, works around it:

I thing, so result should not be sorted - it's same like normal group by.

Normal GROUP BY wouldn't have ignored the ORDER BY clause I included.

sorry, now I understand - simply it is a bug. I fixed it

Where's the new patch?

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

#18

Pavel Stehule

pavel.stehule@gmail.com

about 17 years ago

In reply to: Олег Царев (#5)

Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

Hello,

here is last my grouping sets patch - next work will be done by Oleg.

This patch has full functionality of grouping sets - with support for
grouping via expr.

postgres=# select extract(year from a),a,b, sum(c) from foo group by
grouping sets(extract(year from a), a, b,()); date_part | a |
b | sum
-----------+------------+----+-----
2001 | | | 40
| 2001-10-12 | | 20
| 2001-10-11 | | 20
| | 10 | 40
| | | 40
(5 rows)

postgres=# select * from foo;
a | b | c
------------+----+----
2001-10-11 | 10 | 20
2001-10-12 | 10 | 20
(2 rows)

This patch is WIP - has full functionality, but is based on CTE - what
is mas/menos ugly hack. But for testing and first view about grouping
sets, it should do good work.

regards
Pavel Stehule

2009/5/10 Олег Царев <zabivator@gmail.com>:

Show quoted text

I will write separated mail about rollup.
Can you send me some links or information about "CTE"? What is it?
Also, I need some deep knownledge about postgresql aggregation
calculation (executor part) - can you help me (links, books, name of
source files, etc)?.
After get additional information i will can continue discussion.
Thanls
2009/5/10 Pavel Stehule <pavel.stehule@gmail.com>:

2009/5/10 Олег Царев <zabivator@gmail.com>:

Hello, Pavel.

I read you letter, and found some points:

a) Simple transformation any GROUPING SETS in GROUP BY + UNION ALL
require clone source for every group.
It's so slow.
My point: we can make for start simple implementation.
b) Your sollution it's so good, IMHO
WITH q AS (SELECT * FROM some)
SELECT * FROM q GROUP BY a
UNION ALL
SELECT * FROM q GROUP BY b
But can you describe me how it's used on planner (calculate cost) and exector.

look on CTE source code or ask to CTE author.

Sharing source - it's mind - we don't process tree, we process
DIRECTED GRAPH. Hard, many bugs, so on, not?

for example - code for DISTINCT is shared with code for aggregation.
My idea is based on similarity

GROUPING SETS is subset of CTE, nonrecursive CTE is subset of UNION ALL

PostgreSQL support sharing nodes of executor? How? Where i can read?

I don't know, what do you thing exactly sharing nodes? Before CTE I
had to write special holder node, but it isn't necessary now.

Next. Ok, we spool source. After we use simple hash group/sort group +
union all? Or more advanced techniques? If different group by - it's
not different from my idea, it's logic additional (spool source for
fast rewind)

I thing so trivial implementation via UNION ALL should work, but any
optimisations means lot of work - and when you add rewind
functionality, you will got trivial nonrecursive CTE implementation.
When I wrote patch, there wasn't possible an use CTE, so I played wit
own nodes. Now, the more clean solution is probably based on current
CTE base.

c) Don't forget also about node requiments - if we use hash table as
container for source - we reduce some sorting properties - and query
select A,B from TABLE group by ((A,B),(B)) order by a,b require
sorting on output of grouping sets.
It's mind - we need insert sort.

Yes, probably there should be possible some techniques - that should
to eliminate final sort op. I am not sure if it is really important.
Now I thinking what is the bigger devil a) patch complexity b) some
suboptimality (mainly redundant call of agg nodes). For me a.
Aggregates are not gratis, but significantly more expensive is
repeated seq data scan. So it is first goal. Later we should to thing
about next optimalizations.

d) We can make SORT ROLLUP and SORT REDUCE ROLLUP.
first - sort group by use sorting for grouping data and calculate aggregations.
So, 'order by A,B,C' also 'order by A,B' also order by 'A' also order by ''
What it's mind? It's mind we can use sort features for implement SORT
ROLLUP. For every sub group we calculate self aggregates... oh, i bad
say. Look on "select a,sum(b) from table group by a". Sort group
grouping by a, and for every a-group calculate aggregations (sum), for
ROLLUP A,B we can make three aggregations ; for A,B than A than () -
and use one path on source calculate aggregations on every step, and
for split source on different subsets.
It's more perfomance, than different group by and union all.

Maybe, I don't know. What I know:

* UNION ALL will have better result on indexed sets and MIN, MAX
aggregates. CTE isn't optimized now for these cases.
* UNION ALL will be slow for large sets (over cache),
* CTE needs some more optimalizations,
* pg core hackers dislike big and complex patches,
* pg core hackers like any improved current code base.

I am thinking so Grouping Sets based on CTE should be more commitable
code. It doesn't mean so your ideas are wrong, but these
optimalization should to work on CTE too.

select * from table group by rollup(a,b,c)

have to have generate same plan as

with q as (select * from table)
select * from q group by a,b,c
union all
select * from q group by a,b
union all
select * from q group by a
union all
select * from q;

and CTE is more general then Grouping Sets, so it is better do
optimalization over CTE than Grouping Sets.

Regards
Pavel Stehule

Look for MS SQL:
http://www.kodyaz.com/articles/sql-server-2005-cube-rollup-cannot-compute-distinct-aggregates.aspx
Why MS SQL don't support distinct aggregations? I think - because
every distinct aggregations in MS SQL require hash, and many
aggregations - it's so slow.

Thank you!
2009/5/10 Pavel Stehule <pavel.stehule@gmail.com>:

Hello

some other info is on http://wiki.postgresql.org/wiki/Grouping_Sets

Maybe in e few weak I'll have some a prototype based on CTE. My older
technique based on hashed tables should be well, but it carry lot of
unshared code. Using CTE means so we can optimize CTE and GROUPING
SETS together. I plan to transform query

SELECT * FROM some GROUP BY GROUPING SETS(a, b)

to

WITH q AS (SELECT * FROM some)
SELECT * FROM q GROUP BY a
UNION ALL
SELECT * FROM q GROUP BY b

2009/5/10 Олег Царев <zabivator@gmail.com>:

Hello all.
Please, approve my ideas for implementation.

Standart has feature T431: Extended grouping capabilities.
This feature i found in TODO-list:
http://wiki.postgresql.org/wiki/Todo -> SQL Commands -> TO DO

MS SQL 2005 partial support this feature:
http://www.kodyaz.com/articles/sql-server-2005-cube-rollup-cannot-compute-distinct-aggregates.aspx
http://blogs.msdn.com/craigfr/archive/2007/09/21/aggregation-with-rollup.aspx

MS SQL 2008 support this feature:
http://blogs.msdn.com/craigfr/archive/2007/10/11/grouping-sets-in-sql-server-2008.aspx

Oracle support this feature:
http://www.compshack.com/sql/oracle-group-rollup

So, it's short notes about GROUPING SETS, but more complete
information have in a official documentation of MS SQL and Oracle
(copyright limited for send as attach).

First. GROUPG SETS.

select A,B,C,SUM(D) from table group by GROUPING SETS( (A,B,C), (A),
() ) - it's example of use grouping sets.
Semantic of this construction - make group by over source more, than
one group of column.
It's very wide key - A,B C. In result set of this example we can find
result set of select select A,B,C,SUM(D) from table group by A,B,C -
as subset. It's mind: "GROUP BY A,B,C" - subset of "GROUP BY GROUPING
SETS( (A,B,C), (A), () )
Two subset - is GROUP BY A B, and instead C column we look NULL.
Third subset - GROUP BY (), instead A,B,C - NULL, one row - it's name
"GRAND TOTAL". - calculate over all subset without grouping

Also have function "GROUPING" it's function say about null - "real
null" (from table) or generated by "GROUP BY GROUPING SETS"

My point: this feature can implement over GROUP BY and UNION ALL
We can make decomposition of "GROUP BY GROUPING SETS( (A,B,C),(A),()
)" to select A,B,C fron table GROUP BY A,B,C .UNION ALL select
A,B,NULL from table group BY A,B UNION ALL NUll,NUll,NULL from table
group by();

So, it's very simple, don't require modification of executor and
callibrate cost - only parser and semantic anylysis,
'
So, ROLLUP(A1,...An) is alias to "GROUP BY GROUPING SETS(
(A1,...,An),(A1,...,An-1),... (An-1,An),(An),() ),
CUBE - analogue.

If this idea it's good - i can write code base on old patch
http://archives.postgresql.org/pgsql-hackers/2008-10/msg00838.php or
from clean list (as you wish).

This is suboptimal. When SELECT * FROM X GROUP BY GROUPING SETS ...,
where X is some joined data, then you repeat JOIN on every grouping
set. So your solution is simple for implementation, but it should be
really slow.

Regards
Pavel Stehule

In future i know how to implement ROLLUP more optimal (executor
iterator) and use this ROLLUP for optimisation another GROUP BY,
GROUPING SETS.

Thanks.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Alvaro Herrera

alvherre@2ndquadrant.com

almost 17 years ago

In reply to: Олег Царев (#10)

Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

Олег Царев escribió:

Hello all!
If no one objecte (all agree, in other say) i continue work on patch -
particulary, i want support second strategy (tuple store instead of
hash-table) for save order of source (more cheap solution in case with
grouping sets + order by), investigate and brainstorm another
optimisation, writing regression tests and technical documentation.
But I need some time for complete my investigation internals of
PostgreSQL, particulary CTE.

Where are we on this patch? Is it moving forward?

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#20

Hitoshi Harada

umi.tanuki@gmail.com

almost 17 years ago

In reply to: Alvaro Herrera (#19)

Re: Implementation of GROUPING SETS (T431: Extended grouping capabilities)

2009/8/8 Alvaro Herrera <alvherre@commandprompt.com>:

Олег Царев escribió:

Hello all!
If no one objecte (all agree, in other say) i continue work on patch -
particulary, i want support second strategy (tuple store instead of
hash-table) for save order of source (more cheap solution in case with
grouping sets + order by), investigate and brainstorm another
optimisation, writing regression tests and technical documentation.
But I need some time for complete my investigation internals of
PostgreSQL, particulary CTE.

Where are we on this patch? Is it moving forward?

It seems to me that the patch goes backward.

I looked trough the gsets-0.6.diff for about an hour, and found it is
now only a syntax sugar that builds multiple GROUP BY queries based on
CTE functionality. There's no executor modification.

If I remember correctly, the original patch touched executor parts.
I'd buy if the GROUPING SETS touches executor but I don't if this is
only syntax sugar, because you can write it as the same by yourself
without GROUPING SETS syntax. The motivation we push this forward is
performance that cannot be made by rewriting query, I guess.

Because GROUP BY we have today is a subset of GROUPING SETS by
definition, I suppose we'll refactor nodeAgg.c so that it is allowed
to take multiple group definitions. And we must support both of
HashAgg and GroupAgg. For HashAgg, it is easier in any case as the
earlier patch does. For GroupAgg, it is a bit complicated since we
sort by different key sets.

When we want GROUPING SET(a, b), at first we sort by a and aggregate
then sort by b and aggregate. This is the same as:

select a, null, count(*) from x group by a
union all
select null, b, count(*) from x group by b

so nothing better than query rewriting unless we invent something new.

But in case of sub total and grand total like ROLLUP query, GroupAgg
can do it by one-time scan by having multiple life cycle PerGroup
state.

Anyway, before going ahead we need to find rough sketch of how to
implement this feature. Only syntax sugar is acceptable? Or internal
executor support is necessary?

Regards,

--
Hitoshi Harada

#21

Олег Царев

zabivator@gmail.com

almost 17 years ago

In reply to: Hitoshi Harada (#20)

#22

Pavel Stehule

pavel.stehule@gmail.com

almost 17 years ago

In reply to: Hitoshi Harada (#20)