No: implied sort with group by

Started by Zeugswetter Andreas DBTalmost 28 years ago16 messages
#1Zeugswetter Andreas DBT
Andreas.Zeugswetter@telecom.at

Does the SQL standard say anything about an implied sort when
grouping or is it up to the user to include an ORDER BY clause?
darrenk

Without order by the result set is never guaranteed to be ordered in a
specific way (standard speak). The order is dependent on the chosen
query path, which
changes from query to query.

Andreas

#2Noname
darrenk@insightdist.com
In reply to: Zeugswetter Andreas DBT (#1)
Re: [HACKERS] No: implied sort with group by

Does the SQL standard say anything about an implied sort when
grouping or is it up to the user to include an ORDER BY clause?
darrenk

Without order by the result set is never guaranteed to be ordered in a
specific way (standard speak). The order is dependent on the chosen
query path, which changes from query to query.

Andreas

Postgres should then do an internal sort before grouping. In the second
of your examples, I take the above to mean that either row could be
returned first.

In order to get that result set though, the data needs to be sorted before
getting to the group by node in the executor. The order of that internal
sort is purely arbitrary, it just has to be done.

This is what I think is missing or broken right now.

select * from t1;

a b c
1 x
2 x
3 z
2 x

4 row(s) retrieved.

select b,c,sum(a) from t1 group by b,c;

b c (sum)

x 5
z 3

2 row(s) retrieved.

darrenk

#3Thomas G. Lockhart
lockhart@alumni.caltech.edu
In reply to: Noname (#2)
Re: [HACKERS] No: implied sort with group by

Does the SQL standard say anything about an implied sort when
grouping or is it up to the user to include an ORDER BY clause?

Up to the user. SQL is a set-oriented language. The fact that many/most/all
implementations order results to then do grouping is an implementation
detail, not a language definition.

This is what I think is missing or broken right now.

select * from t1;

a b c
1 x
2 x
3 z
2 x

4 row(s) retrieved.

select b,c,sum(a) from t1 group by b,c;

b c (sum)

x 5
z 3

2 row(s) retrieved.

Sorry, I've lost the thread. What is broken? I get this same result, and
(assuming that column "b" is full of nulls) I think this the correct result.

- Tom

#4Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Thomas G. Lockhart (#3)
Re: [HACKERS] No: implied sort with group by

Does the SQL standard say anything about an implied sort when
grouping or is it up to the user to include an ORDER BY clause?

Up to the user. SQL is a set-oriented language. The fact that many/most/all
implementations order results to then do grouping is an implementation
detail, not a language definition.

This is what I think is missing or broken right now.

select * from t1;

a b c
1 x
2 x
3 z
2 x

4 row(s) retrieved.

select b,c,sum(a) from t1 group by b,c;

b c (sum)

x 5
z 3

2 row(s) retrieved.

Sorry, I've lost the thread. What is broken? I get this same result, and
(assuming that column "b" is full of nulls) I think this the correct result.

At one point, it was thought that NULLs shouldn't be grouped, but I
backed out the patch. There is a problem with GROUP BY on large
datasets, and Vadim knows the cause, and will work on it later.

--
Bruce Momjian
maillist@candle.pha.pa.us

#5Noname
darrenk@insightdist.com
In reply to: Bruce Momjian (#4)
Re: [HACKERS] No: implied sort with group by

This is what I think is missing or broken right now.

select * from t1;

a b c
1 x
2 x
3 z
2 x

4 row(s) retrieved.

select b,c,sum(a) from t1 group by b,c;

b c (sum)

x 5
z 3

2 row(s) retrieved.

Sorry, I've lost the thread. What is broken? I get this same result, and
(assuming that column "b" is full of nulls) I think this the correct result.

At one point, it was thought that NULLs shouldn't be grouped, but I
backed out the patch. There is a problem with GROUP BY on large
datasets, and Vadim knows the cause, and will work on it later.

Different from the grouping by NULLs issue...

The above results are from Sybase. If these same four rows are inserted into
postgres, the second query will return three rows. Something like...

b|c|sum(a)
|x|3
|z|3
|x|2

It does this not because of the null values of column b, but because the data is
not sorted before getting to the group by node if the user does not explicitly put
an order by in the query. IMHO, postgres should put an arbitrary sort node in the
tree so that the data can be properly grouped as the group node iterates over it.

And even if I put an "order by c" clause in there, I still get three rows, they're
just properly sorted. :)

darrenk

#6Thomas G. Lockhart
lockhart@alumni.caltech.edu
In reply to: Noname (#5)
Re: [HACKERS] No: implied sort with group by

This is what I think is missing or broken right now.

select * from t1;

a b c
1 x
2 x
3 z
2 x

4 row(s) retrieved.

select b,c,sum(a) from t1 group by b,c;

b c (sum)

x 5
z 3

2 row(s) retrieved.

Sorry, I've lost the thread. What is broken? I get this same result, and
(assuming that column "b" is full of nulls) I think this the correct result.

At one point, it was thought that NULLs shouldn't be grouped, but I
backed out the patch. There is a problem with GROUP BY on large
datasets, and Vadim knows the cause, and will work on it later.

Different from the grouping by NULLs issue...

The above results are from Sybase. If these same four rows are inserted into
postgres, the second query will return three rows. Something like...

b|c|sum(a)
|x|3
|z|3
|x|2

It does this not because of the null values of column b, but because the data is
not sorted before getting to the group by node if the user does not explicitly put
an order by in the query. IMHO, postgres should put an arbitrary sort node in the
tree so that the data can be properly grouped as the group node iterates over it.

And even if I put an "order by c" clause in there, I still get three rows, they're
just properly sorted. :)

Not necessarily true; as I said, I get the same result as above (with the 980112
source tree; have things changed since??). Perhaps you are running into the sorting
problem which seemed to be present on larger tables only?

- Tom

postgres=> select b,c,sum(a) from t1 group by b,c;
b|c|sum
-+-+---
|x| 5
|z| 3
(2 rows)

postgres=> select * from t1;
a|b|c
-+-+-
1| |x
2| |x
2| |x
3| |z
(4 rows)

#7Zeugswetter Andreas DBT
Andreas.Zeugswetter@telecom.at
In reply to: Thomas G. Lockhart (#6)
Re: [HACKERS] No: implied sort with group by

darrenk wrote:

postgres should then do an internal sort before grouping. In the

second

of your examples, I take the above to mean that either row could be
returned first.

yes (standard speak)

In order to get that result set though, the data needs to be sorted

before

getting to the group by node in the executor. The order of that

internal

sort is purely arbitrary, it just has to be done.

either that or group the result set into an implicit temp table
internally.
If a compound index exists on b,c then an index path could be used
instead.
(compound btree would also be good for order by, of course yall know ;-)
An auto index path (temp index is created on the fly and dropped after
query completion)
might also be considered.

Andreas

#8Noname
darrenk@insightdist.com
In reply to: Zeugswetter Andreas DBT (#7)
Re: [HACKERS] No: implied sort with group by

Not necessarily true; as I said, I get the same result as above (with the 980112
source tree; have things changed since??). Perhaps you are running into the sorting
problem which seemed to be present on larger tables only?

- Tom

postgres=> select b,c,sum(a) from t1 group by b,c;
b|c|sum
-+-+---
|x| 5
|z| 3
(2 rows)

postgres=> select * from t1;
a|b|c
-+-+-
1| |x
2| |x
2| |x
3| |z
(4 rows)

Hmmm...I have a snapshot from about ten days ago, I'll get something newer and
try this again. I've been putting off getting a new one until I get the block
size patch done. Annoying to put the changes back into a new src copy (but not
as annoying as dealing with #(*&^! insurance companies claims departments).

Is the order from the second query the order that the rows were inserted?

Do you get the same results if you insert the (3,null,'z') second or third so
the rows are stored out of order? I was getting my bad results with this same
data, only four rows. I do have a problem with large groupings on two or more
columns running out of memory, but not the problem that linux users are seeing.

darrenk

#9Thomas G. Lockhart
lockhart@alumni.caltech.edu
In reply to: Noname (#8)
Re: [HACKERS] No: implied sort with group by

Not necessarily true; as I said, I get the same result as above (with the 980112
source tree; have things changed since??). Perhaps you are running into the sorting
problem which seemed to be present on larger tables only?

postgres=> select b,c,sum(a) from t1 group by b,c;
b|c|sum
-+-+---
|x| 5
|z| 3
(2 rows)

postgres=> select * from t1;
a|b|c
-+-+-
1| |x
2| |x
2| |x
3| |z
(4 rows)

Hmmm...I have a snapshot from about ten days ago

Is the order from the second query the order that the rows were inserted?

Do you get the same results if you insert the (3,null,'z') second or third so
the rows are stored out of order? I was getting my bad results with this same
data, only four rows.

OUCH! You are right, there is a problem with this simple test case:

postgres=> select b,c,sum(a) from t1 group by b,c;
b|c|sum
-+-+---
|x| 5
|z| 3
|x| 0
(3 rows)

postgres=> select * from t1;
a|b|c
-+-+-
1| |x
2| |x
2| |x
3| |z
0| |x
(5 rows)

I just inserted a single out-of-order row at the end of the table which, since the
integer value is zero, should have not affected the result. Sorry I didn't understand
the nature of the test case.

- Tom

#10Noname
darrenk@insightdist.com
In reply to: Thomas G. Lockhart (#9)
Re: [HACKERS] No: implied sort with group by

postgres=> select b,c,sum(a) from t1 group by b,c;
b|c|sum
-+-+---
|x| 5
|z| 3
|x| 0
(3 rows)

postgres=> select * from t1;
a|b|c
-+-+-
1| |x
2| |x
2| |x
3| |z
0| |x
(5 rows)

I just inserted a single out-of-order row at the end of the table which, since the
integer value is zero, should have not affected the result. Sorry I didn't understand
the nature of the test case.

The order of the implicit sort would be arbitrary, but should first sort on
any fields in a given ORDER BY to help speed things up later in the tree.

What are the effects of sorted or partially sorted input data to the sort code?

The current group/aggregate code seems to just loop over the tuples as they are.

I see two ways to fix the above, one w/minimal code, second w/more work, but
potentially better speed for large queries.

1. Put a sort node immediately before the group node, taking into account
any user given ordering. Also make sure the optimizer is aware of this sort
when calculating query costs.

2. Instead of sorting the tuples before grouping, add a hashing system to
the group node so that the pre-sorting is not necessary.

Hmmm...is this a grouping problem or an aggregate problem? Or both? The first
query above should have the data sorted before aggregating, shouldn't it, or I
am still missing a piece of this puzzle?

darrenk

#11Noname
ocie@paracel.com
In reply to: Noname (#10)
Re: [HACKERS] No: implied sort with group by

Darren King wrote:

[examples deleted]

I see two ways to fix the above, one w/minimal code, second w/more work, but
potentially better speed for large queries.

1. Put a sort node immediately before the group node, taking into account
any user given ordering. Also make sure the optimizer is aware of this sort
when calculating query costs.

2. Instead of sorting the tuples before grouping, add a hashing system to
the group node so that the pre-sorting is not necessary.

Hmmm...is this a grouping problem or an aggregate problem? Or both? The first
query above should have the data sorted before aggregating, shouldn't it, or I
am still missing a piece of this puzzle?

darrenk

The hash should work. If the hash key is built on the group-by items,
then any row with the same entries in these columns will get hashed to
the same result row. At this point, it should be fairly easy to
perform aggregation (test and substitute for min and max, add for
sum,avg, etc).

Ocie

#12Thomas G. Lockhart
lockhart@alumni.caltech.edu
In reply to: Noname (#10)
Re: [HACKERS] No: implied sort with group by

postgres=> select b,c,sum(a) from t1 group by b,c;
b|c|sum
-+-+---
|x| 5
|z| 3
|x| 0
(3 rows)

postgres=> select * from t1;
a|b|c
-+-+-
1| |x
2| |x
2| |x
3| |z
0| |x
(5 rows)

I just inserted a single out-of-order row at the end of the table which, since the
integer value is zero, should have not affected the result. Sorry I didn't understand
the nature of the test case.

Hmmm...is this a grouping problem or an aggregate problem? Or both? The first
query above should have the data sorted before aggregating, shouldn't it, or I
am still missing a piece of this puzzle?

fwiw, I see the same incorrect behavior in v6.2.1p5.

- Tom

#13Zeugswetter Andreas DBT
Andreas.Zeugswetter@telecom.at
In reply to: Thomas G. Lockhart (#12)
Re: [HACKERS] No: implied sort with group by

Ocie wrote:

2. Instead of sorting the tuples before grouping, add a hashing

system to

the group node so that the pre-sorting is not necessary.

The hash should work. If the hash key is built on the group-by items,
then any row with the same entries in these columns will get hashed to
the same result row. At this point, it should be fairly easy to
perform aggregation (test and substitute for min and max, add for
sum,avg, etc).

Have been thinking about that too. Is each list in the current hash
implementation sorted ?
Cause else how do you know, that a certain value has not already been
processed ?
Answer: keep a list of already processed groups in memory. Initialize it
for each new hash list.

Andreas

#14Andrew Martin
martin@biochemistry.ucl.ac.uk
In reply to: Zeugswetter Andreas DBT (#13)
Re: [HACKERS] No: implied sort with group by

postgres=> select b,c,sum(a) from t1 group by b,c;
b|c|sum
-+-+---
|x| 5
|z| 3
|x| 0
(3 rows)

postgres=> select * from t1;
a|b|c
-+-+-
1| |x
2| |x
2| |x
3| |z
0| |x
(5 rows)

I just inserted a single out-of-order row at the end of the table which, since the
integer value is zero, should have not affected the result. Sorry I didn't understand
the nature of the test case.

Hmmm...is this a grouping problem or an aggregate problem? Or both? The first
query above should have the data sorted before aggregating, shouldn't it, or I
am still missing a piece of this puzzle?

fwiw, I see the same incorrect behavior in v6.2.1p5.

- Tom

And in v6.1. If b is a space (rather than a NULL), then the behaviour is correct
so it must be a problem in grouping NULLs.

Andrew

----------------------------------------------------------------------------
Dr. Andrew C.R. Martin University College London
EMAIL: (Work) martin@biochem.ucl.ac.uk (Home) andrew@stagleys.demon.co.uk
URL: http://www.biochem.ucl.ac.uk/~martin
Tel: (Work) +44(0)171 419 3890 (Home) +44(0)1372 275775

#15Noname
darrenk@insightdist.com
In reply to: Andrew Martin (#14)
Re: [HACKERS] No: implied sort with group by

postgres=> select b,c,sum(a) from t1 group by b,c;
b|c|sum
-+-+---
|x| 5
|z| 3
|x| 0
(3 rows)

postgres=> select * from t1;
a|b|c
-+-+-
1| |x
2| |x
2| |x
3| |z
0| |x
(5 rows)

...

And in v6.1. If b is a space (rather than a NULL), then the behaviour is correct
so it must be a problem in grouping NULLs.

explain select b,c,sum(a) from foo group by b,c; -- gives...

Aggregate (cost=0.00 size=0 width=0)
-> Group (cost=0.00 size=0 width=0)
-> Sort (cost=0.00 size=0 width=0)
-> Seq Scan on foo (cost=0.00 size=0 width=28)

There sort is there before the grouping operation, so this would seem to point to
the sort code incorrectly setting something when handling NULLs.

This doesn't seem like the same bug that Vadim found since a small data set such as
this one _shouldn't_ be going out to a tape file.

darrenk

#16Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Noname (#15)
Re: [HACKERS] No: implied sort with group by

And in v6.1. If b is a space (rather than a NULL), then the behaviour is correct
so it must be a problem in grouping NULLs.

explain select b,c,sum(a) from foo group by b,c; -- gives...

Aggregate (cost=0.00 size=0 width=0)
-> Group (cost=0.00 size=0 width=0)
-> Sort (cost=0.00 size=0 width=0)
-> Seq Scan on foo (cost=0.00 size=0 width=28)

There sort is there before the grouping operation, so this would seem to point to
the sort code incorrectly setting something when handling NULLs.

This doesn't seem like the same bug that Vadim found since a small data set such as
this one _shouldn't_ be going out to a tape file.

We have a NULL sort patch for psort in 6.3. Are you running the most
recent sources?

--
Bruce Momjian
maillist@candle.pha.pa.us