Window functions patch v04 for the September commit fest

Started by Hitoshi Haradaalmost 18 years ago35 messageshackers

umi.tanuki@gmail.com

almost 18 years ago

Here's the latest window functions patch against HEAD. It seems to be
ready for the September commit fest, as added documents, WINDOW clause
feature and misc tests. I guess this would be the window functions
feature freeze for 8.4. The remaining feature will be implemented for
the later release.

This patch consists of:
- windowed aggregates
- cooperate with GROUP BY aggregates
- some optimizations with multiple windows
- ranking functions
- WINDOW clause
- windowing SQL regression tests
- sgml documents

Looking up the total road map, the dropped features are:

- sliding window (window framing)
- lead(), lag(), etc. that reach for random rows
- user defined window functions

The first and second topics are difficult to implement currently.
Because these features require random row access, it seems that
tuplestore would be able to save multiple positions to mark/restore.
This is fundamental change that is over my capability. Also, user
defined window functions seem to have much more to decide. I think I
can't put into shape the general needs of user's window functions now.
Lacking these feature, this stage looks compatible to SQLServer 2005,
while Oracle and DB2 have almost full of the specification.

Also, current implementation has only a type of plan which uses sort
operation. It should be optimized by re-position the windows and/or
using hashtable.

Oh, git hosting repository is updated as well.

Regards,

--
Hitoshi Harada

Hitoshi Harada

umi.tanuki@gmail.com

almost 18 years ago

In reply to: Hitoshi Harada (#1)

Re: Window functions patch v04 for the September commit fest

2008/8/30 Hitoshi Harada <umi.tanuki@gmail.com>:

Here's the latest window functions patch against HEAD. It seems to be
ready for the September commit fest, as added documents, WINDOW clause
feature and misc tests. I guess this would be the window functions
feature freeze for 8.4. The remaining feature will be implemented for
the later release.

This patch consists of:
- windowed aggregates
- cooperate with GROUP BY aggregates
- some optimizations with multiple windows
- ranking functions
- WINDOW clause
- windowing SQL regression tests
- sgml documents

Looking up the total road map, the dropped features are:

- sliding window (window framing)
- lead(), lag(), etc. that reach for random rows
- user defined window functions

The first and second topics are difficult to implement currently.
Because these features require random row access, it seems that
tuplestore would be able to save multiple positions to mark/restore.
This is fundamental change that is over my capability. Also, user
defined window functions seem to have much more to decide. I think I
can't put into shape the general needs of user's window functions now.
Lacking these feature, this stage looks compatible to SQLServer 2005,
while Oracle and DB2 have almost full of the specification.

Also, current implementation has only a type of plan which uses sort
operation. It should be optimized by re-position the windows and/or
using hashtable.

Oh, git hosting repository is updated as well.

README is updated.
http://umitanuki.net/pgsql/wfv04/design.html

Please add link to commit fest wiki.

Regards,

--
Hitoshi Harada

David Fetter

david@fetter.org

almost 18 years ago

In reply to: Hitoshi Harada (#2)

Re: Window functions patch v04 for the September commit fest

On Mon, Sep 01, 2008 at 12:15:19PM +0900, Hitoshi Harada wrote:

README is updated.
http://umitanuki.net/pgsql/wfv04/design.html

Please add link to commit fest wiki.

Added :)

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Simon Riggs

simon@2ndQuadrant.com

almost 18 years ago

In reply to: Hitoshi Harada (#1)

Re: Window functions patch v04 for the September commit fest

On Sat, 2008-08-30 at 02:04 +0900, Hitoshi Harada wrote:

Here's the latest window functions patch against HEAD. It seems to be
ready for the September commit fest, as added documents, WINDOW clause
feature and misc tests. I guess this would be the window functions
feature freeze for 8.4. The remaining feature will be implemented for
the later release.

This patch consists of:
- windowed aggregates
- cooperate with GROUP BY aggregates
- some optimizations with multiple windows
- ranking functions
- WINDOW clause
- windowing SQL regression tests
- sgml documents

Looking up the total road map, the dropped features are:

- sliding window (window framing)
- lead(), lag(), etc. that reach for random rows
- user defined window functions

The first and second topics are difficult to implement currently.
Because these features require random row access, it seems that
tuplestore would be able to save multiple positions to mark/restore.
This is fundamental change that is over my capability. Also, user
defined window functions seem to have much more to decide. I think I
can't put into shape the general needs of user's window functions now.
Lacking these feature, this stage looks compatible to SQLServer 2005,
while Oracle and DB2 have almost full of the specification.

If you've done all of that, then I'm impressed. Well done.

Few general comments

* The docs talk about "windowing functions", yet you talk about "window
functions" here. I think the latter is correct, but whichever we choose
we should be consistent (and hopefully matching SQL Standard).

* You don't use duplicate the examples from the docs into the tests,
which is always a good way to get conflicting reports from users. :-)

* The tests seem very light for such a huge range of new functionality.
(8 tests is hardly sufficient). I'd like to see a wide range of tests -
probably 5-10 times as many individual test statements. I would also
like to see test failures that illustrate the as-yet unimplemented
features and the warning messages that are thrown - this will help us
understand exactly what is missing also. It would also be useful to see
other common coding mistakes/misconceptions and the corresponding error
messages.

Also, current implementation has only a type of plan which uses sort
operation. It should be optimized by re-position the windows and/or
using hashtable.

I would like to see some performance test results also. It would be good
to know whether they are fast/slow etc.. It will definitely help the
case for inclusion if they are faster than alternative multi-statement
approaches to solving the basic data access problems.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 18 years ago

In reply to: Hitoshi Harada (#2)

Re: Window functions patch v04 for the September commit fest

Hitoshi Harada wrote:

2008/8/30 Hitoshi Harada <umi.tanuki@gmail.com>:

Here's the latest window functions patch against HEAD. It seems to be
ready for the September commit fest, as added documents, WINDOW clause
feature and misc tests. I guess this would be the window functions
feature freeze for 8.4. The remaining feature will be implemented for
the later release.

This patch consists of:
- windowed aggregates
- cooperate with GROUP BY aggregates
- some optimizations with multiple windows
- ranking functions
- WINDOW clause
- windowing SQL regression tests
- sgml documents

Looking up the total road map, the dropped features are:

- sliding window (window framing)
- lead(), lag(), etc. that reach for random rows
- user defined window functions

Ok, I'm starting to read up on SQL2003 window functions, and to review
this patch. Here's a long brain-dump of my first thoughts on the design.

Let's list a couple of requirements first:

1. It's important that what gets committed now can be extended to handle
all of the window function stuff in SQL2003 in the future, as well as
user-defined-window-functions in the spirit of PostgreSQL extensibility.
Even if we don't implement all of it in this release.

2. Performance needs to be reasonable, in the face a large number of
input rows. These are typically used in OLAP queries that process
gazillions of rows. Some window functions need access to all rows in the
window frame or partition, so you can't reasonably expect great
performance with those if the working set is large, but those that
don't, should work with finite memory requirements, and reasonably fast.

Because of 1., let's focus on the interface for
user-defined-window-functions first. Ideally, the interface is such that
all the window functions and aggregates defined in SQL2003, and others
we might want to have in core or as pgfoundry projects, can be
implemented using it.

I think there's still some confusion in the design document about what a
frame is. The terminology section is great, it helped me a lot to get
started, so thank you for that. However, in the "Actual design in the
patch" you describe how a window function works, and you talk about a
window frame, but elsewhere in the doc you note that window frames are
not implemented yet.

As already discussed, there's two different kind of window expressions:
window aggregates, and ranking aggregates (you call them window
functions in your doc). It seems that they are different enough that we
can't bang them together. For example, you can't use a window frame
clause with ranking aggregates, and ranking aggregates need to see the
whole row, whereas window aggregates only see the expressions passed as
argument.

API for window functions (aka. ranking aggregate functions)
------------------------

Borrowing ideas from the many approaches listed in the design doc, I
propose an interface using a set-returning function. Similar to Simon's
proposal, but taking into account that a ranking function might need to
see some or all input tuples before returning anything.

For each input tuple, the window function can return any number of
tuples, including 0. After processing all input values, though, the
total number of output rows returned must match the total number of
input values fed into it. The function must eventually output one value
for each input value, in the same order.

This is a flexible design that allows for different kind of
implementations; the function can return one output tuple for each input
tuple (ROW_NUMBER(), RANK() etc.), or it can swallow all input tuples
first, and only then return all output tuples (e.g. CUME_DIST).

Here's a couple of step-by-step examples of how some functions would
work. Imagine input values 1, 3, 4, 4, 5:

RANK
----

Invocation Input Value returned Internal state (after)
1 1 1 (last value 1, count 1, rank=1)
2 3 2 (last value 2, count 1, rank=2)
3 4 3 (last value 4, count 1, rank=3)
4 4 3 (last value 4, count 2, rank=3)
5 5 5 (last value 5, count 1, rank=5)

So RANK returns one tuple after each input tuple. Just like in your
current patch, keep the last value returned, a row count, and the
current rank as state.

LAG
---

Internal state is a FIFO queue of values. Each input value is pushed to
the FIFO, and the tuple that falls out of the queue is returned.

For example, LAG(<col>,2,0):

Invocation Input row Value returned Internal state (after)
1 1 0 (1)
2 3 0 (3, 1)
3 4 1 (4, 3)
4 4 3 (4, 4)
5 5 4 (5, 4)

LEAD
----

Return nothing for first <offset> input tuples. Then return the current
input tuple for the rest of the input tuples, and after the last input
tuple, return <offset> number of default values.

For example, LEAD(<col>,2,0):

Invocation Input row Value returned Internal state (after)
1 1 <none> (offsetbegin = 1, pad=2)
2 3 <none> (offsetbegin = 0, pad=2)
3 4 4 (offsetbegin = 0, pad=2)
4 4 4 (offsetbegin = 0, pad=2)
5 5 5 (offsetbegin = 0, pad=2)
6 <none> 0 (offsetbegin = 0, pad=1)
7 <none> 0 (offsetbegin = 0, pad=0)

For functions like LEAD or CUME_DIST that don't immediately start
returning values, the executor will need a tuplestore to buffer input
rows until it gets their values from the window function.

The syntax for creating a ranking aggregate might look something like this:

CREATE RANKING AGGREGATE name (
SFUNC = sfunc,
STYPE = state_data_type,
OUTFUNC = outfunc
)

This is very similar to Simon's proposal, and to current CREATE
AGGREGATE, but tweaked a bit to fix the problems Hitoshi pointed out.

sfunc is called once for each input row. Unlike with normal aggregates,
sfunc is passed the whole input row, so that e.g RANK can compare it
against the previous row, or LEAD can buffer it.

outfunc is a set-returning function, and it is called until it returns
no more rows, after each sfunc invocation.

Window aggregates
-----------------

Like normal aggregates, window aggregates process N input values, and
return one output value. In normal GROUP BY expressions, the aggregate
function is called once each group, but in a window aggregate expression
with a window frame (aka sliding window), there is as many "groups" as
there is rows.

In fact, any normal aggregate function can also be used as a window
aggregate and vice versa. The trivial implementation is to have a buffer
to hold all values currently in the window frame, and for each input
row, invoke the aggregate function over all the rows currently in the
frame. I think we'll need support that, to allow using any built-in or
user-defined aggregate as a window aggregate.

However, we also want to provide optimized versions of the common
aggregates. Aggregates like COUNT, SUM, and AVG can easily be
implemented more efficiently, without rescanning all the tuples in the
group.

I propose an extension to the current CREATE AGGREGATE syntax to allow
for optimized versions. Currently, the syntax is like:

CREATE AGGREGATE name ( input_data_type [ , ... ] ) (
SFUNC = sfunc,
STYPE = state_data_type
[ , FINALFUNC = ffunc ]
[ , INITCOND = initial_condition ]
[ , SORTOP = sort_operator ]
)

Let's add a function to allow removing tuples from the current window frame:

CREATE AGGREGATE name ( input_data_type [ , ... ] ) (
SFUNC = sfunc,
STYPE = state_data_type
[ , FINALFUNC = ffunc ]
[ , INITCOND = initial_condition ]
[ , SORTOP = sort_operator ]
[ , SUBTRACTFUNC = subfunc,
)

subfunc is like sfunc, except that the input value is "subtracted" from
the current value. On each input row, the executor calls sfunc for all
new tuples that enter the window frame, and subfunc for all tuples that
exit it. Another difference is that finalfunc can be called many times
during the lifecycle of an aggregate invocation. For example, the
subfunc for COUNT would decrement the row count by one, and for SUM,
subfunc would subtract the removed value from the current sum. Perhaps
we should also support an "opportunistic subfunc", that could either
perform the subtraction, or return a special value indicating that it
can't be done, and we need to calculate the aggregate from scratch as if
subfunc wasn't defined. That would be useful for MIN/MAX, which can be
implemented by just keeping track of the current MIN/MAX value, and
"subtracting" any other value than the current MIN/MAX is a no-op, but
"subtracting" the current MIN/MAX value requires scanning all the rest
of the tuples from scratch.

PS. Have you looked at the "hypothetical set functions" in SQL2003?

PPS. I was just about to send this, when I noticed that LEAD/LAG are in
fact not in the SQL2003 draft I have at hand. In fact, none of the
window functions defined there, RANK, DENSE_RANK, PERCENT_RANK,
CUME_DIST and ROW_NUMBER, require random access to the tuples. Still
seems like a good idea to be have the flexibility for them.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

David Fetter

david@fetter.org

almost 18 years ago

In reply to: Heikki Linnakangas (#5)

Re: Window functions patch v04 for the September commit fest

On Mon, Sep 01, 2008 at 09:00:47PM +0300, Heikki Linnakangas wrote:

Hitoshi Harada wrote:

2008/8/30 Hitoshi Harada <umi.tanuki@gmail.com>:

Here's the latest window functions patch against HEAD. It seems to be
ready for the September commit fest, as added documents, WINDOW clause
feature and misc tests. I guess this would be the window functions
feature freeze for 8.4. The remaining feature will be implemented for
the later release.

This patch consists of:
- windowed aggregates
- cooperate with GROUP BY aggregates
- some optimizations with multiple windows
- ranking functions
- WINDOW clause
- windowing SQL regression tests
- sgml documents

Looking up the total road map, the dropped features are:

- sliding window (window framing)
- lead(), lag(), etc. that reach for random rows
- user defined window functions

Ok, I'm starting to read up on SQL2003 window functions,

Maybe it would be better to read the SQL2008 standard
<http://wiscorp.com/sql200n.zip> :)

and to review this patch. Here's a long brain-dump of my first
thoughts on the design.

Let's list a couple of requirements first:

1. It's important that what gets committed now can be extended to
handle all of the window function stuff in SQL2003 in the future,
as well as user-defined-window-functions in the spirit of
PostgreSQL extensibility. Even if we don't implement all of it in
this release.

2. Performance needs to be reasonable, in the face a large number of
input rows. These are typically used in OLAP queries that process
gazillions of rows. Some window functions need access to all rows in the
window frame or partition, so you can't reasonably expect great
performance with those if the working set is large, but those that
don't, should work with finite memory requirements, and reasonably fast.

Because of 1., let's focus on the interface for
user-defined-window-functions first. Ideally, the interface is such
that all the window functions and aggregates defined in SQL2003,
and others we might want to have in core or as pgfoundry projects,
can be implemented using it.

SQL2008 defines the following:

<rank function type> ::=
RANK
| DENSE_RANK
| PERCENT_RANK
| CUME_DIST

<ntile function> ::=
NTILE <left paren> <number of tiles> <right paren>

<lead or lag function> ::=
<lead or lag> <left paren> <lead or lag extent>
[ <comma> <offset> [ <comma> <default expression> ] ] <right paren>
[ <null treatment> ]

<lead or lag> ::=
LEAD | LAG

<null treatment> ::=
RESPECT NULLS | IGNORE NULLS

<first or last value function> ::=
<first or last value> <left paren> <value expression> <right paren> [ <null treatment> ]

<first or last value> ::=
FIRST_VALUE | LAST_VALUE

<nth value function> ::=
NTH_VALUE <left paren> <value expression> <comma> <nth row> <right paren>
[ <from first or last> ] [ <null treatment> ]

<from first or last> ::=
FROM FIRST
| FROM LAST

<in-line window specification> ::=
<window specification>

For functions like LEAD or CUME_DIST that don't immediately start
returning values, the executor will need a tuplestore to buffer input
rows until it gets their values from the window function.

For these aggregates, should there be an API that signals that a tuplestore
will be needed?

The syntax for creating a ranking aggregate might look something like this:

CREATE RANKING AGGREGATE name (
SFUNC = sfunc,
STYPE = state_data_type,
OUTFUNC = outfunc
)

This is very similar to Simon's proposal, and to current CREATE
AGGREGATE, but tweaked a bit to fix the problems Hitoshi pointed
out.

sfunc is called once for each input row. Unlike with normal
aggregates, sfunc is passed the whole input row, so that e.g RANK
can compare it against the previous row, or LEAD can buffer it.

outfunc is a set-returning function, and it is called until it
returns no more rows, after each sfunc invocation.

Window aggregates
-----------------

Like normal aggregates, window aggregates process N input values, and
return one output value. In normal GROUP BY expressions, the aggregate
function is called once each group, but in a window aggregate expression
with a window frame (aka sliding window), there is as many "groups" as
there is rows.

In fact, any normal aggregate function can also be used as a window
aggregate and vice versa. The trivial implementation is to have a buffer
to hold all values currently in the window frame, and for each input
row, invoke the aggregate function over all the rows currently in the
frame. I think we'll need support that, to allow using any built-in or
user-defined aggregate as a window aggregate.

However, we also want to provide optimized versions of the common
aggregates. Aggregates like COUNT, SUM, and AVG can easily be
implemented more efficiently, without rescanning all the tuples in the
group.

I propose an extension to the current CREATE AGGREGATE syntax to allow
for optimized versions. Currently, the syntax is like:

CREATE AGGREGATE name ( input_data_type [ , ... ] ) (
SFUNC = sfunc,
STYPE = state_data_type
[ , FINALFUNC = ffunc ]
[ , INITCOND = initial_condition ]
[ , SORTOP = sort_operator ]
)

Let's add a function to allow removing tuples from the current window frame:

CREATE AGGREGATE name ( input_data_type [ , ... ] ) (
SFUNC = sfunc,
STYPE = state_data_type
[ , FINALFUNC = ffunc ]
[ , INITCOND = initial_condition ]
[ , SORTOP = sort_operator ]
[ , SUBTRACTFUNC = subfunc,
)

subfunc is like sfunc, except that the input value is "subtracted" from
the current value. On each input row, the executor calls sfunc for all
new tuples that enter the window frame, and subfunc for all tuples that
exit it. Another difference is that finalfunc can be called many times
during the lifecycle of an aggregate invocation. For example, the
subfunc for COUNT would decrement the row count by one, and for SUM,
subfunc would subtract the removed value from the current sum. Perhaps
we should also support an "opportunistic subfunc", that could either
perform the subtraction, or return a special value indicating that it
can't be done, and we need to calculate the aggregate from scratch as if
subfunc wasn't defined. That would be useful for MIN/MAX, which can be
implemented by just keeping track of the current MIN/MAX value, and
"subtracting" any other value than the current MIN/MAX is a no-op, but
"subtracting" the current MIN/MAX value requires scanning all the rest
of the tuples from scratch.

PS. Have you looked at the "hypothetical set functions" in SQL2003?

They're kinda neat :)

PPS. I was just about to send this, when I noticed that LEAD/LAG are in
fact not in the SQL2003 draft I have at hand. In fact, none of the
window functions defined there, RANK, DENSE_RANK, PERCENT_RANK,
CUME_DIST and ROW_NUMBER, require random access to the tuples. Still
seems like a good idea to be have the flexibility for them.

They're in SQL:2008.

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Bruce Momjian

bruce@momjian.us

almost 18 years ago

In reply to: Heikki Linnakangas (#5)

Re: Window functions patch v04 for the September commit fest

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

sfunc is called once for each input row. Unlike with normal aggregates, sfunc
is passed the whole input row, so that e.g RANK can compare it against the
previous row, or LEAD can buffer it.

I'm not sure I follow this bit about being passed the whole input row. How
would that relate to something like lag(x*y, 2) for example?

outfunc is a set-returning function, and it is called until it returns no more
rows, after each sfunc invocation.

And in any case I feel like this is abstraction on the cheap. The only reason
it's so general is because it's leaving all the work to the functions to
implement.

It also means we get no benefit from cases like

SELECT lag(x,5),lag(y,5),lag(z,5)

where the executor could keep one tuplestore and for all of them. For that
matter it could keep one tuplestore even for cases like lag(x,5),lag(y,4).

What would the executor do for a query like

SELECT lead(x,1),lead(y,2),lead(y,3)

It would not only have to keep a tuplestore to buffer the output but it would
have to deal with receiving data from different SRFs at different times. The
best approach I can think of would be to keep a tuplestore for each SRF using
themas queues, reading old values from the head as soon as they all have at
least one new value in them.

And it doesn't answer how to deal with things like

SELECT lag(x,1) OVER (ORDER BY a), lag(x,1) OVER (ORDER BY b)

I, uh, don't actually have any ideas of how to deal with that one :(

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's PostGIS support!

Hitoshi Harada

umi.tanuki@gmail.com

almost 18 years ago

In reply to: Heikki Linnakangas (#5)

Re: Window functions patch v04 for the September commit fest

Thanks for comments,

2008/9/2 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>:

Ok, I'm starting to read up on SQL2003 window functions, and to review this
patch. Here's a long brain-dump of my first thoughts on the design.

Let's list a couple of requirements first:

1. It's important that what gets committed now can be extended to handle all
of the window function stuff in SQL2003 in the future, as well as
user-defined-window-functions in the spirit of PostgreSQL extensibility.
Even if we don't implement all of it in this release.

2. Performance needs to be reasonable, in the face a large number of input
rows. These are typically used in OLAP queries that process gazillions of
rows. Some window functions need access to all rows in the window frame or
partition, so you can't reasonably expect great performance with those if
the working set is large, but those that don't, should work with finite
memory requirements, and reasonably fast.

Because of 1., let's focus on the interface for
user-defined-window-functions first. Ideally, the interface is such that all
the window functions and aggregates defined in SQL2003, and others we might
want to have in core or as pgfoundry projects, can be implemented using it.

I think there's still some confusion in the design document about what a
frame is. The terminology section is great, it helped me a lot to get
started, so thank you for that. However, in the "Actual design in the patch"
you describe how a window function works, and you talk about a window frame,
but elsewhere in the doc you note that window frames are not implemented
yet.

Sorry about that. In my understanding, the "Window Frame" is defined
by clauses such like "ROWS BETWEEN ... ", "RANGE BETWEEN ... " or so,
contrast to "Window Partition" defined by "PARTITION BY" clause. A
frame slides within a partition or there's only one frame if those
clauses are not specified. The current patch has not implemented this
yet. I'll update the docs.

As already discussed, there's two different kind of window expressions:
window aggregates, and ranking aggregates (you call them window functions in
your doc). It seems that they are different enough that we can't bang them
together. For example, you can't use a window frame clause with ranking
aggregates, and ranking aggregates need to see the whole row, whereas window
aggregates only see the expressions passed as argument.

I don't like to call the second type "ranking aggregates" because it
may refer to only ranking functions though there are more types of
function like ntile(), lead() and lag(). But "window functions"
doesn't seem appropriate either since it's ambiguous with the general
name of window expressions.

API for window functions (aka. ranking aggregate functions)
------------------------

Borrowing ideas from the many approaches listed in the design doc, I propose
an interface using a set-returning function. Similar to Simon's proposal,
but taking into account that a ranking function might need to see some or
all input tuples before returning anything.

For each input tuple, the window function can return any number of tuples,
including 0. After processing all input values, though, the total number of
output rows returned must match the total number of input values fed into
it. The function must eventually output one value for each input value, in
the same order.

This is a flexible design that allows for different kind of implementations;
the function can return one output tuple for each input tuple (ROW_NUMBER(),
RANK() etc.), or it can swallow all input tuples first, and only then return
all output tuples (e.g. CUME_DIST).

Here's a couple of step-by-step examples of how some functions would work.
Imagine input values 1, 3, 4, 4, 5:

RANK
----

Invocation Input Value returned Internal state (after)
1 1 1 (last value 1, count 1, rank=1)
2 3 2 (last value 2, count 1, rank=2)
3 4 3 (last value 4, count 1, rank=3)
4 4 3 (last value 4, count 2, rank=3)
5 5 5 (last value 5, count 1, rank=5)

So RANK returns one tuple after each input tuple. Just like in your current
patch, keep the last value returned, a row count, and the current rank as
state.

LAG
---

Internal state is a FIFO queue of values. Each input value is pushed to the
FIFO, and the tuple that falls out of the queue is returned.

For example, LAG(<col>,2,0):

Invocation Input row Value returned Internal state (after)
1 1 0 (1)
2 3 0 (3, 1)
3 4 1 (4, 3)
4 4 3 (4, 4)
5 5 4 (5, 4)

LEAD
----

Return nothing for first <offset> input tuples. Then return the current
input tuple for the rest of the input tuples, and after the last input
tuple, return <offset> number of default values.

For example, LEAD(<col>,2,0):

Invocation Input row Value returned Internal state (after)
1 1 <none> (offsetbegin = 1, pad=2)
2 3 <none> (offsetbegin = 0, pad=2)
3 4 4 (offsetbegin = 0, pad=2)
4 4 4 (offsetbegin = 0, pad=2)
5 5 5 (offsetbegin = 0, pad=2)
6 <none> 0 (offsetbegin = 0, pad=1)
7 <none> 0 (offsetbegin = 0, pad=0)

For functions like LEAD or CUME_DIST that don't immediately start returning
values, the executor will need a tuplestore to buffer input rows until it
gets their values from the window function.

The syntax for creating a ranking aggregate might look something like this:

CREATE RANKING AGGREGATE name (
SFUNC = sfunc,
STYPE = state_data_type,
OUTFUNC = outfunc
)

This is very similar to Simon's proposal, and to current CREATE AGGREGATE,
but tweaked a bit to fix the problems Hitoshi pointed out.

sfunc is called once for each input row. Unlike with normal aggregates,
sfunc is passed the whole input row, so that e.g RANK can compare it against
the previous row, or LEAD can buffer it.

outfunc is a set-returning function, and it is called until it returns no
more rows, after each sfunc invocation.

Your proposal is smarter than the current implementation. But it
doesn't seem complete in some way. From logical point of view, the
window functions should be allowed to access whole of rows in the
frame the current row belongs to (c.f. inverse distribution
functions). From implementing and performance point of view, there
need to consider such case like mixing window aggregates and ranking
aggregates in the same window, which means it is better that the two
types of functions are processed in the similar flow. Also logically,
SQL spec doesn't forbid ranking aggregates in sliding window frames.
So your design may be flexible enough to cover different requirements
of lead()/lag() but I don't know if it will.

Window aggregates
-----------------

Like normal aggregates, window aggregates process N input values, and return
one output value. In normal GROUP BY expressions, the aggregate function is
called once each group, but in a window aggregate expression with a window
frame (aka sliding window), there is as many "groups" as there is rows.

In fact, any normal aggregate function can also be used as a window
aggregate and vice versa. The trivial implementation is to have a buffer to
hold all values currently in the window frame, and for each input row,
invoke the aggregate function over all the rows currently in the frame. I
think we'll need support that, to allow using any built-in or user-defined
aggregate as a window aggregate.

However, we also want to provide optimized versions of the common
aggregates. Aggregates like COUNT, SUM, and AVG can easily be implemented
more efficiently, without rescanning all the tuples in the group.

I propose an extension to the current CREATE AGGREGATE syntax to allow for
optimized versions. Currently, the syntax is like:

CREATE AGGREGATE name ( input_data_type [ , ... ] ) (
SFUNC = sfunc,
STYPE = state_data_type
[ , FINALFUNC = ffunc ]
[ , INITCOND = initial_condition ]
[ , SORTOP = sort_operator ]
)

Let's add a function to allow removing tuples from the current window frame:

CREATE AGGREGATE name ( input_data_type [ , ... ] ) (
SFUNC = sfunc,
STYPE = state_data_type
[ , FINALFUNC = ffunc ]
[ , INITCOND = initial_condition ]
[ , SORTOP = sort_operator ]
[ , SUBTRACTFUNC = subfunc,
)

subfunc is like sfunc, except that the input value is "subtracted" from the
current value. On each input row, the executor calls sfunc for all new
tuples that enter the window frame, and subfunc for all tuples that exit it.
Another difference is that finalfunc can be called many times during the
lifecycle of an aggregate invocation. For example, the subfunc for COUNT
would decrement the row count by one, and for SUM, subfunc would subtract
the removed value from the current sum. Perhaps we should also support an
"opportunistic subfunc", that could either perform the subtraction, or
return a special value indicating that it can't be done, and we need to
calculate the aggregate from scratch as if subfunc wasn't defined. That
would be useful for MIN/MAX, which can be implemented by just keeping track
of the current MIN/MAX value, and "subtracting" any other value than the
current MIN/MAX is a no-op, but "subtracting" the current MIN/MAX value
requires scanning all the rest of the tuples from scratch.

Hmmm, its name may be better as "invfunc" than subfunc. I feel
horrible in imagining to manage the combination of functions that
don't support subfunc, ones that does and ranking aggregates but it's
possible.

PS. Have you looked at the "hypothetical set functions" in SQL2003?

I had a glance but not deeply, since I found I would be lost in design
if deeply diving into it. :)

--
Hitoshi Harada

Hitoshi Harada

umi.tanuki@gmail.com

almost 18 years ago

In reply to: Bruce Momjian (#7)

Re: Window functions patch v04 for the September commit fest

2008/9/2 Gregory Stark <stark@enterprisedb.com>:

And it doesn't answer how to deal with things like

SELECT lag(x,1) OVER (ORDER BY a), lag(x,1) OVER (ORDER BY b)

I, uh, don't actually have any ideas of how to deal with that one :(

For different windows we make different window nodes with different
sort nodes as the patch does.

Regards,

--
Hitoshi Harada

#10

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 18 years ago

In reply to: David Fetter (#6)

Re: Window functions patch v04 for the September commit fest

David Fetter wrote:

On Mon, Sep 01, 2008 at 09:00:47PM +0300, Heikki Linnakangas wrote:

Ok, I'm starting to read up on SQL2003 window functions,

Maybe it would be better to read the SQL2008 standard
<http://wiscorp.com/sql200n.zip> :)

Ah, thanks!

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#11

Tom Lane

tgl@sss.pgh.pa.us

almost 18 years ago

In reply to: Heikki Linnakangas (#10)

Re: Window functions patch v04 for the September commit fest

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

David Fetter wrote:

On Mon, Sep 01, 2008 at 09:00:47PM +0300, Heikki Linnakangas wrote:

Ok, I'm starting to read up on SQL2003 window functions,

Maybe it would be better to read the SQL2008 standard
<http://wiscorp.com/sql200n.zip> :)

Ah, thanks!

Er, s/2008 standard/200n draft with uncertain chances of passage/

It's not like we haven't seen a SQL draft go down in flames before.

regards, tom lane

#12

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

almost 18 years ago

In reply to: Bruce Momjian (#7)

Re: Window functions patch v04 for the September commit fest

Gregory Stark wrote:

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

sfunc is called once for each input row. Unlike with normal aggregates, sfunc
is passed the whole input row, so that e.g RANK can compare it against the
previous row, or LEAD can buffer it.

I'm not sure I follow this bit about being passed the whole input row. How
would that relate to something like lag(x*y, 2) for example?

Well, sfunc needs to be passed not only the whole input tuple, but also
the values of the arguments.

Hmm. The spec says that the offset argument of lead and lag needs to be
a numeric literal. To enforce that at parse time, we'd have to hard-code
that into the grammar. Or we could check it at run-time, and throw an
error if its value changes across invocations of the sfunc function.

outfunc is a set-returning function, and it is called until it returns no more
rows, after each sfunc invocation.

And in any case I feel like this is abstraction on the cheap. The only reason
it's so general is because it's leaving all the work to the functions to
implement.

It also means we get no benefit from cases like

SELECT lag(x,5),lag(y,5),lag(z,5)

where the executor could keep one tuplestore and for all of them. For that
matter it could keep one tuplestore even for cases like lag(x,5),lag(y,4).

True. I guess it comes down to how much complexity we're willing to have
in the executor node to handle that more efficiently.

Note that the memory usage is the same either way, except for the extra
constant overhead of multiple tuplestores, because each tuplestore
inside the lag implementation only needs to store its own column.

What would the executor do for a query like

SELECT lead(x,1),lead(y,2),lead(y,3)

It would not only have to keep a tuplestore to buffer the output but it would
have to deal with receiving data from different SRFs at different times. The
best approach I can think of would be to keep a tuplestore for each SRF using
themas queues, reading old values from the head as soon as they all have at
least one new value in them.

Hitoshi solved that creating a separate Window node for each window
function. So the plan tree for that would look something like:

Window (lead(x,1))
Window (lead(y,2))
Window (lead(y,3))
Seq Scan ...

That keeps the Window node implementation quite simple because it only
needs to handle one window function at a time.

And it doesn't answer how to deal with things like

SELECT lag(x,1) OVER (ORDER BY a), lag(x,1) OVER (ORDER BY b)

I, uh, don't actually have any ideas of how to deal with that one :(

The above handles that by putting extra sort nodes in between the Window
nodes. Not too efficient, but I don't see any way around it.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#13

David Fetter

david@fetter.org

almost 18 years ago

In reply to: Tom Lane (#11)

Re: Window functions patch v04 for the September commit fest

On Tue, Sep 02, 2008 at 02:42:25AM -0400, Tom Lane wrote:

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

David Fetter wrote:

On Mon, Sep 01, 2008 at 09:00:47PM +0300, Heikki Linnakangas wrote:

Ok, I'm starting to read up on SQL2003 window functions,

Maybe it would be better to read the SQL2008 standard
<http://wiscorp.com/sql200n.zip> :)

Ah, thanks!

Er, s/2008 standard/200n draft with uncertain chances of passage/

It's not like we haven't seen a SQL draft go down in flames before.

Do you think that anything in the windowing functions section will
disappear?

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

#14

Tom Lane

tgl@sss.pgh.pa.us

almost 18 years ago

In reply to: David Fetter (#13)

Re: Window functions patch v04 for the September commit fest

David Fetter <david@fetter.org> writes:

On Tue, Sep 02, 2008 at 02:42:25AM -0400, Tom Lane wrote:

It's not like we haven't seen a SQL draft go down in flames before.

Do you think that anything in the windowing functions section will
disappear?

Who's to say?

I have no objection to looking at the 2003 and 200n documents in
parallel, especially if there are places where 200n clarifies the
intent of 2003. But I'd be suspicious of designing around
entirely-new features presented by 200n.

regards, tom lane

#15

Stefan Kaltenbrunner

stefan@kaltenbrunner.cc

almost 18 years ago

In reply to: Tom Lane (#14)

Re: Window functions patch v04 for the September commit fest

Tom Lane wrote:

David Fetter <david@fetter.org> writes:

On Tue, Sep 02, 2008 at 02:42:25AM -0400, Tom Lane wrote:

It's not like we haven't seen a SQL draft go down in flames before.

Do you think that anything in the windowing functions section will
disappear?

Who's to say?

I have no objection to looking at the 2003 and 200n documents in
parallel, especially if there are places where 200n clarifies the
intent of 2003. But I'd be suspicious of designing around
entirely-new features presented by 200n.

well http://www.wiscorp.com/SQLStandards.html

states:

"This points to the documents which wlll likely be the documents that
represent the SQL 2008 Standard. These documents are out for
International Standard ballot at this time. The vote is an Up/Down vote.
No changes allowed."

Stefan

#16

Hitoshi Harada

umi.tanuki@gmail.com

almost 18 years ago

In reply to: Heikki Linnakangas (#12)

Re: Window functions patch v04 for the September commit fest

2008/9/2 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>:

Gregory Stark wrote:

What would the executor do for a query like

SELECT lead(x,1),lead(y,2),lead(y,3)

It would not only have to keep a tuplestore to buffer the output but it
would
have to deal with receiving data from different SRFs at different times.
The
best approach I can think of would be to keep a tuplestore for each SRF
using
themas queues, reading old values from the head as soon as they all have
at
least one new value in them.

Hitoshi solved that creating a separate Window node for each window
function. So the plan tree for that would look something like:

Window (lead(x,1))
Window (lead(y,2))
Window (lead(y,3))
Seq Scan ...

That keeps the Window node implementation quite simple because it only needs
to handle one window function at a time.

To say more accurately, one Window node can handle more than one
window function. If it is thought to be the same using equalFuncs
comparing targetlists, some functions are put into one node.

Regards,

--
Hitoshi Harada

#17

Simon Riggs

simon@2ndQuadrant.com

almost 18 years ago

In reply to: Heikki Linnakangas (#5)

Re: Window functions patch v04 for the September commit fest

On Mon, 2008-09-01 at 21:00 +0300, Heikki Linnakangas wrote:

1. It's important that what gets committed now can be extended to handle
all of the window function stuff in SQL2003 in the future, as well as
user-defined-window-functions in the spirit of PostgreSQL extensibility.
Even if we don't implement all of it in this release.

I think whatever public APIs get published now must be sufficient to
support user-defined-window functions across all future releases, so on
that point I agree completely. (One reason why I argued earlier in
favour of avoiding an API for now).

We shouldn't restrict the implementation of the internals to be upward
compatible though because I foresee some aspect of complexity stalling
and thus killing the patch in the short term if we do that. We don't
have much time left for this release.

If we only have the combined (brain * time) to get a partial
implementation in for this release then I would urge we go for that,
rather than wait for perfection - as long as there are no other negative
effects.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#18

Martijn van Oosterhout

kleptog@svana.org

almost 18 years ago

In reply to: Simon Riggs (#17)

Re: Window functions patch v04 for the September commit fest

On Tue, Sep 02, 2008 at 10:44:31AM +0100, Simon Riggs wrote:

If we only have the combined (brain * time) to get a partial
implementation in for this release then I would urge we go for that,
rather than wait for perfection - as long as there are no other negative
effects.

"premature optimization is the root of all evil." (Knuth, Donald.)

"make it work, then make it better".

Getting a partial implementation that works out now is far better than
waiting until its perfect.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

Please line up in a tree and maintain the heap invariant while
boarding. Thank you for flying nlogn airlines.

#19

Simon Riggs

simon@2ndQuadrant.com

almost 18 years ago

In reply to: Tom Lane (#14)

Re: Window functions patch v04 for the September commit fest

On Tue, 2008-09-02 at 03:14 -0400, Tom Lane wrote:

David Fetter <david@fetter.org> writes:

On Tue, Sep 02, 2008 at 02:42:25AM -0400, Tom Lane wrote:

It's not like we haven't seen a SQL draft go down in flames before.

Do you think that anything in the windowing functions section will
disappear?

Who's to say?

I have no objection to looking at the 2003 and 200n documents in
parallel, especially if there are places where 200n clarifies the
intent of 2003. But I'd be suspicious of designing around
entirely-new features presented by 200n.

I have confirmation from Michael Gorman, Wiscorp, that

The new standard was approved in early Summer. SQL 2008 is finished.

So as of now, SQL2008 exists, all hail. SQL2003 and earlier versions
have been superseded and can be ignored.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#20

Hitoshi Harada

umi.tanuki@gmail.com

almost 18 years ago

In reply to: Simon Riggs (#4)

Re: Window functions patch v04 for the September commit fest

2008/9/2 Simon Riggs <simon@2ndquadrant.com>:

If you've done all of that, then I'm impressed. Well done.

Few general comments

* The docs talk about "windowing functions", yet you talk about "window
functions" here. I think the latter is correct, but whichever we choose
we should be consistent (and hopefully matching SQL Standard).

That's what I'm embarrassed. Now we have "window functions" meaning
two, one for generic name of window expressions and the other for
non-window-aggregates. It is a word play, which is difficult problem
for non-native people, but... let's use "window functions". I'll
revise sgml docs.

* You don't use duplicate the examples from the docs into the tests,
which is always a good way to get conflicting reports from users. :-)

* The tests seem very light for such a huge range of new functionality.
(8 tests is hardly sufficient). I'd like to see a wide range of tests -
probably 5-10 times as many individual test statements. I would also
like to see test failures that illustrate the as-yet unimplemented
features and the warning messages that are thrown - this will help us
understand exactly what is missing also. It would also be useful to see
other common coding mistakes/misconceptions and the corresponding error
messages.

Also, current implementation has only a type of plan which uses sort
operation. It should be optimized by re-position the windows and/or
using hashtable.

I would like to see some performance test results also. It would be good
to know whether they are fast/slow etc.. It will definitely help the
case for inclusion if they are faster than alternative multi-statement
approaches to solving the basic data access problems.