BUG #10256: COUNT(*) behaves sort of like RANK() when used over a window containing an ORDER BY

Started by Nonamealmost 12 years ago8 messagesbugs
Jump to latest
#1Noname
arnaud.mouronval@gmail.com

The following bug has been logged on the website:

Bug reference: 10256
Logged by: Arnaud Mouronval
Email address: arnaud.mouronval@gmail.com
PostgreSQL version: 9.3.4
Operating system: Windows 8.1
Description:

I discovered a problem while using a window that used an ORDER BY clause,
and using this window with ROW_NUMBER() and COUNT(*) at the same time.

Here is a short SQL script to replicate it :

DROP TABLE IF EXISTS tmp_count_window_bug_data;
CREATE TABLE tmp_count_window_bug_data (c1 character varying(8), c2
character varying(8));
INSERT INTO tmp_count_window_bug_data (c1, c2)
VALUES
('A', 'AA'),
('A', 'AB'),
('B', 'BA'),
('B', 'BB'),
('B', 'BC'),
('B', 'BC'),
('B', 'BD');

SELECT
c1,
COUNT(*) OVER(PARTITION BY c1),
COUNT(*) OVER(PARTITION BY c1 ORDER BY c2)
FROM tmp_count_window_bug_data;

Result on my machine :
"A";2;1
"A";2;2
"B";5;1
"B";5;2
"B";5;4
"B";5;4
"B";5;5

I was expecting to get the same values in the last 2 columns.
As you can see, the third column looks much more like a RANK() (except
RANK() would have answered 3 instead of 4 for 2 lines).

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noname (#1)
Re: BUG #10256: COUNT(*) behaves sort of like RANK() when used over a window containing an ORDER BY

arnaud.mouronval@gmail.com writes:

I discovered a problem while using a window that used an ORDER BY clause,
and using this window with ROW_NUMBER() and COUNT(*) at the same time.

Here is a short SQL script to replicate it :

DROP TABLE IF EXISTS tmp_count_window_bug_data;
CREATE TABLE tmp_count_window_bug_data (c1 character varying(8), c2
character varying(8));
INSERT INTO tmp_count_window_bug_data (c1, c2)
VALUES
('A', 'AA'),
('A', 'AB'),
('B', 'BA'),
('B', 'BB'),
('B', 'BC'),
('B', 'BC'),
('B', 'BD');

SELECT
c1,
COUNT(*) OVER(PARTITION BY c1),
COUNT(*) OVER(PARTITION BY c1 ORDER BY c2)
FROM tmp_count_window_bug_data;

Result on my machine :
"A";2;1
"A";2;2
"B";5;1
"B";5;2
"B";5;4
"B";5;4
"B";5;5

This looks correct to me. Keep in mind that the default window framing
clause is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, which really
means "all rows up to and including the last sort peer of the current row".
With no ORDER BY clause, all rows in the partition are peers, so you get a
count covering the whole partition no matter which row you are on. With
an ORDER BY, you get a count of only the rows up to the current row (or
its last peer, when there are duplicate values in the ORDER BY columns).

I was expecting to get the same values in the last 2 columns.

For that you'd want RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED
FOLLOWING.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#3David G. Johnston
david.g.johnston@gmail.com
In reply to: Tom Lane (#2)
Re: BUG #10256: COUNT(*) behaves sort of like RANK() when used over a window containing an ORDER BY

Tom Lane-2 wrote

arnaud.mouronval@

writes:

I discovered a problem while using a window that used an ORDER BY clause,
and using this window with ROW_NUMBER() and COUNT(*) at the same time.

This looks correct to me. Keep in mind that the default window framing
clause is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, which really
means "all rows up to and including the last sort peer of the current
row".
With no ORDER BY clause, all rows in the partition are peers, so you get a
count covering the whole partition no matter which row you are on. With
an ORDER BY, you get a count of only the rows up to the current row (or
its last peer, when there are duplicate values in the ORDER BY columns).

I was expecting to get the same values in the last 2 columns.

For that you'd want RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED
FOLLOWING.

Tom's quote is pretty much an abbreviation of the following taken directly
from:
http://www.postgresql.org/docs/9.3/interactive/sql-select.html

"The default framing option is RANGE UNBOUNDED PRECEDING, which is the same
as RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. With ORDER BY, this
sets the frame to be all rows from the partition start up through the
current row's last peer. Without ORDER BY, all rows of the partition are
included in the window frame, since all rows become peers of the current
row."

A more useful description and example is found in the tutorial (about
halfway down, the sum(*) example):

http://www.postgresql.org/docs/9.3/interactive/tutorial-window.html

The term "peer" in the first quotation is confusing to me. My understanding
is that "PARTITION BY" defines which rows are "peers" - even if that isn't
the wording used. So now using "peers" in relation to the FRAME clause
confuses the issue. IMO the "since all rows..." is really superfluous - I
would understand:

"In the presence of an ORDER BY only rows up to the current row (including
all [partition] peers) are considered. Omitting an ORDER BY causes the
entire PARTITION to be considered the FRAME. Other frame definitions can be
requested by supplying an explicit FRAME clause."

I knew the answer but in the interest of looking through the user's eyes I
wanted to go take a look (again, done this before) and really am not
surprised that these kinds of questions are being asked. Just having to
scan through 5 sections of the documentation is difficult. That said, the
tutorial section and inline examples make it quite clear how the
with/without ORDER BY behavior of window functions (sum in the example)
cause different results.

At minimum the top of 9.3.4 could provide links to, and
descriptions/summaries of, what the other 4 sections cover and why things
are broken out the way they are. The other cross-references could point
back to that section-subsection as a kind of launch point: "Please see
section 3.5.1 for an overview of, and links to, other related sections."

Just some food for thought if anyone is industrious and annoyed enough to
act on it.

David J.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/BUG-10256-COUNT-behaves-sort-of-like-RANK-when-used-over-a-window-containing-an-ORDER-BY-tp5803073p5803103.html
Sent from the PostgreSQL - bugs mailing list archive at Nabble.com.

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: David G. Johnston (#3)
Re: Re: BUG #10256: COUNT(*) behaves sort of like RANK() when used over a window containing an ORDER BY

David G Johnston <david.g.johnston@gmail.com> writes:

The term "peer" in the first quotation is confusing to me. My understanding
is that "PARTITION BY" defines which rows are "peers" - even if that isn't
the wording used. So now using "peers" in relation to the FRAME clause
confuses the issue.

AFAIK we've only used "peers" in this context to mean "rows with equal
sort-column values". I don't think we have a specific term for "rows
appearing in the same partition", but certainly neither the docs nor the
code mean that when they say "peer".

[ looks at SQL standard... ] The standard uses "peer" in this way too,
so that's where we got the term from. Because of that, I'm unwilling
to adopt your suggestion of thinking that "peer" means "member of the
same partition". However, it seems like maybe we need to clarify the
term some more, eg define what we mean by it in more places. Are there
any specific places that you think this should be done?

At minimum the top of 9.3.4 could provide links to, and
descriptions/summaries of, what the other 4 sections cover and why things
are broken out the way they are. The other cross-references could point
back to that section-subsection as a kind of launch point: "Please see
section 3.5.1 for an overview of, and links to, other related sections."

No particular objection to doing something like that.

Just some food for thought if anyone is industrious and annoyed enough to
act on it.

Not me, at least not in the near future.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#5David G. Johnston
david.g.johnston@gmail.com
In reply to: Tom Lane (#4)
Re: Re: BUG #10256: COUNT(*) behaves sort of like RANK() when used over a window containing an ORDER BY

On Wed, May 7, 2014 at 8:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

David G Johnston <david.g.johnston@gmail.com> writes:

The term "peer" in the first quotation is confusing to me. My

understanding

is that "PARTITION BY" defines which rows are "peers" - even if that

isn't

the wording used. So now using "peers" in relation to the FRAME clause
confuses the issue.

AFAIK we've only used "peers" in this context to mean "rows with equal
sort-column values". I don't think we have a specific term for "rows
appearing in the same partition", but certainly neither the docs nor the
code mean that when they say "peer".

[ looks at SQL standard... ] The standard uses "peer" in this way too,
so that's where we got the term from. Because of that, I'm unwilling
to adopt your suggestion of thinking that "peer" means "member of the
same partition". However, it seems like maybe we need to clarify the
term some more, eg define what we mean by it in more places. Are there
any specific places that you think this should be done?

I'm pretty sure I simply considered table rows falling into the same
partition as "peers" without anything actually saying that - and for lack
of a better term. So when I read the quoted paragraph in took some mental
processing to put it in context.

I guess rows falling into the same partition could be deemed "member" rows;
as in having membership in the partition. Then two members are of equal
stature (rank) UNLESS other attributes, defined in the ORDER BY, cause them
to be ranked, in which case members having equal ranking according to the
ORDER BY (i.e. identical values for all specified columns) are considered
peers. The default FRAME clause then only considers members of equal and
lesser rank relative to the current member.​ You can provide an explicit
FRAME clause to look both ahead and behind the current member's ranking
group.

Would need to see where to stick something like that but at least we would
have two terms: member and peer matching with PARTITION and ORDER
respectively.

Does the standard provide a word for tuples that get placed into the same
partition?

David J.

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: David G. Johnston (#5)
Re: Re: BUG #10256: COUNT(*) behaves sort of like RANK() when used over a window containing an ORDER BY

David Johnston <david.g.johnston@gmail.com> writes:

On Wed, May 7, 2014 at 8:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

[ looks at SQL standard... ] The standard uses "peer" in this way too,
so that's where we got the term from. Because of that, I'm unwilling
to adopt your suggestion of thinking that "peer" means "member of the
same partition".

I guess rows falling into the same partition could be deemed "member" rows;
as in having membership in the partition.

Works for me.

Does the standard provide a word for tuples that get placed into the same
partition?

Not that I noticed, but I didn't search hard.

The index of SQL:2011 has one entry for "peer", pointing to this
definition under 10.10 <sort specification list>:

i) Two rows that are not distinct with respect to the <sort
specification>s are said to be peers of each other. The relative
ordering of peers is implementation-dependent.

so in their usage it's not even specific to windows. The terminology
for windows seems to be mostly defined in 4.15.14, and I don't see
a term in there for the rows belonging to a partition.

regards, tom lane

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

#7Bruce Momjian
bruce@momjian.us
In reply to: David G. Johnston (#3)
Re: Re: BUG #10256: COUNT(*) behaves sort of like RANK() when used over a window containing an ORDER BY

On Wed, May 7, 2014 at 04:08:23PM -0700, David G Johnston wrote:

Tom's quote is pretty much an abbreviation of the following taken directly
from:
http://www.postgresql.org/docs/9.3/interactive/sql-select.html

"The default framing option is RANGE UNBOUNDED PRECEDING, which is the same
as RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. With ORDER BY, this
sets the frame to be all rows from the partition start up through the
current row's last peer. Without ORDER BY, all rows of the partition are
included in the window frame, since all rows become peers of the current
row."

A more useful description and example is found in the tutorial (about
halfway down, the sum(*) example):

http://www.postgresql.org/docs/9.3/interactive/tutorial-window.html

The term "peer" in the first quotation is confusing to me. My understanding
is that "PARTITION BY" defines which rows are "peers" - even if that isn't
the wording used. So now using "peers" in relation to the FRAME clause
confuses the issue. IMO the "since all rows..." is really superfluous - I
would understand:

"In the presence of an ORDER BY only rows up to the current row (including
all [partition] peers) are considered. Omitting an ORDER BY causes the
entire PARTITION to be considered the FRAME. Other frame definitions can be
requested by supplying an explicit FRAME clause."

I knew the answer but in the interest of looking through the user's eyes I
wanted to go take a look (again, done this before) and really am not
surprised that these kinds of questions are being asked. Just having to
scan through 5 sections of the documentation is difficult. That said, the
tutorial section and inline examples make it quite clear how the
with/without ORDER BY behavior of window functions (sum in the example)
cause different results.

At minimum the top of 9.3.4 could provide links to, and
descriptions/summaries of, what the other 4 sections cover and why things
are broken out the way they are. The other cross-references could point
back to that section-subsection as a kind of launch point: "Please see
section 3.5.1 for an overview of, and links to, other related sections."

Based on this discussion, I developed the attached patch which qualifies
"peer" in every place it is mentioned --- this should reduce future
confusion.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

Attachments:

peer.difftext/x-diff; charset=us-asciiDownload+14-14
#8Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#7)
Re: Re: BUG #10256: COUNT(*) behaves sort of like RANK() when used over a window containing an ORDER BY

On Wed, Sep 3, 2014 at 06:14:57PM -0400, Bruce Momjian wrote:

At minimum the top of 9.3.4 could provide links to, and
descriptions/summaries of, what the other 4 sections cover and why things
are broken out the way they are. The other cross-references could point
back to that section-subsection as a kind of launch point: "Please see
section 3.5.1 for an overview of, and links to, other related sections."

Based on this discussion, I developed the attached patch which qualifies
"peer" in every place it is mentioned --- this should reduce future
confusion.

Patch applied and backpatched to 9.4.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs