two-argument aggregates and SQL 2003
Hello All,
I just thought about implementing some two-argument aggregate functions from
SQL 2003 (like CORR(x,y), REGR_SLOPE(x,y) etc...)
( http://www.wiscorp.com/SQL2003Features.pdf , page 10)
1) I looked into the architecture of how the aggregate functions are created
and used, and it seemed for me that the structure of the pg_aggregate and
pg_proc tables do not prevent the creating of the two-argument
aggregate functions -- Just for each two-arg. aggregate, the corresponding
three-arg. transition function and the two-arg. aggregate_dummy function
should be added to the pg_proc and a record should be added to the
pg_aggregate. Nothing else and nothing internal need not to be changed to
insert new two-arg. aggregate functions into the core.
Am I right in this ?
2) Also I thought about allowing the user to create the new two-arg.
aggregates. With that I only saw one thing which could/should be changed,
and this is the handling of the BASETYPE attribute of the CREATE AGGREGATE
command.
CREATE AGGREGATE name (
BASETYPE = input_data_type,
SFUNC = sfunc,
STYPE = state_data_type ... )
I am not very familiar with the parser/lexer details in postgres, but is it
possible to allow to do things like that :
CREATE AGGREGATE new_2arg_agg ( BASETYPE = (int,int) , .... )
to create the two-arg. aggregates ?
I'd like to hear any comments/advices/objections...
Regards,
Sergey
*****************************************************
Sergey E. Koposov
Max Planck Institute for Astronomy/Sternberg Astronomical Institute
Web: http://lnfm1.sai.msu.ru/~math
E-mail: math@sai.msu.ru
"Sergey E. Koposov" <math@sai.msu.ru> writes:
... Nothing else and nothing internal need not to be changed to
insert new two-arg. aggregate functions into the core.
Am I right in this ?
IIRC the main issues are the syntax of CREATE AGGREGATE and the actual
implementation in nodeAgg.c. See previous discussions, eg
http://archives.postgresql.org/pgsql-general/2006-03/msg00512.php
I would really prefer to see CREATE AGGREGATE normalized to have a
syntax comparable to CREATE FUNCTION (or DROP AGGREGATE for that
matter):
CREATE AGGREGATE aggname (typname [, ... ]) ...definition...
but it's not clear how to get there without breaking backwards
compatibility :-(
regards, tom lane
On Thu, 13 Apr 2006, Tom Lane wrote:
"Sergey E. Koposov" <math@sai.msu.ru> writes:
... Nothing else and nothing internal need not to be changed to
insert new two-arg. aggregate functions into the core.
Am I right in this ?IIRC the main issues are the syntax of CREATE AGGREGATE and the actual
implementation in nodeAgg.c. See previous discussions, eg
http://archives.postgresql.org/pgsql-general/2006-03/msg00512.php
Actually, I think that I'll try to implement that.
And I already have spent some time looking at the things which should be
changed. And I have the question. Does it make sense to extend the aggregate
functions to the only two-argument case? I mean, does it have a chance to be
accepted ?
Because it seems that it will be much simpler for me to implement the one or
two arg. aggregates (not aggregates with ANY number of args) since it does
not require variable length arrays and additional burdens with the memory
allocations, contexts etc...
I would really prefer to see CREATE AGGREGATE normalized to have a
syntax comparable to CREATE FUNCTION (or DROP AGGREGATE for that
matter):
CREATE AGGREGATE aggname (typname [, ... ]) ...definition...
but it's not clear how to get there without breaking backwards
compatibility :-(
I don't know what to do with CREATE AGGREGATE syntax. I think that I won't
work on that, since at least I want to enable the core (not user created)
two-arg. aggregates. I hope that it's acceptable ...
Regards,
Sergey
*******************************************************************
Sergey E. Koposov
Max Planck Institute for Astronomy/Sternberg Astronomical Institute
Web: http://lnfm1.sai.msu.ru/~math
E-mail: math@sai.msu.ru
"Sergey E. Koposov" <math@sai.msu.ru> writes:
Does it make sense to extend the aggregate
functions to the only two-argument case?
No, I don't think so, for two reasons:
1. The user's-eye view: if someone wants 2 arguments, tomorrow he'll
want 3, etc. There's an old saying that "the only good numbers in
programming language design are zero, one, and N" --- if you allow more
than one of anything, there shouldn't be an upper limit on how many you
allow. In practice there are many places in PG where we break that rule
to the extent of having a configurable upper limit (eg MAX_INDEX_KEYS)
... but small limits hard-wired into the code are just not pleasant.
2. The implementor's view: hard-wired limits are usually not that nice
from a coding standpoint either. Polya's Inventors' Paradox states that
"the more general problem may be easier to solve", and I've found that
usually holds up in program design too. Code that handles exactly 2 of
something is generally uglier and less maintainable than code that
handles N of something, because for example you are tempted to duplicate
chunks of code instead of turning them into loops.
regards, tom lane
Tom Lane wrote:
I would really prefer to see CREATE AGGREGATE normalized to have a
syntax comparable to CREATE FUNCTION (or DROP AGGREGATE for that
matter):
CREATE AGGREGATE aggname (typname [, ... ]) ...definition...
but it's not clear how to get there without breaking backwards
compatibility :-(
To modify the CREATE FUNCTION syntax into a new CREATE AGGREGATE syntax,
we would modify a few things, I think:
CREATE [ OR REPLACE ] FUNCTION
name ( [ [ argmode ] [ argname ] argtype [, ...] ] )
[ RETURNS rettype ]
{ LANGUAGE langname
| IMMUTABLE | STABLE | VOLATILE
| CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT | STRICT
| [ EXTERNAL ] SECURITY INVOKER | [ EXTERNAL ] SECURITY DEFINER
| AS 'definition'
| AS 'obj_file', 'link_symbol'
} ...
[ WITH ( attribute [, ...] ) ]
1) Drop [ argmode ] because there is no OUT or INOUT parameters possible.
2) Change implicit meaning of the [ rettype ] parameter to not allow SETOF.
(I'd love to have aggregates functions that take arbitrary numbers of rows as
input and return arbitrary numbers of rows as output. But I'm guessing the
internals of the backend would require much work to handle it?)
3) Add a state_data_type
4) Add an optional initial_condition
5) Add an optional sort_operator
6) Add some handling of a final_function like behavior, which I have not handled
below. Should it be done like the current CREATE AGGREGATE syntax, where you
must reference another function, or can anybody see a clean way to let this one
function do it all in one shot?
This might give us, excluding any final_function syntax:
CREATE [ OR REPLACE ] AGGREGATE
name ( [ [ argname ] argtype [, ...] ] )
STYPE state_data_type
[ INITCOND initial_condition ]
[ SORTOP sort_operator ]
[ RETURNS rettype ]
{ LANGUAGE langname
| IMMUTABLE | STABLE | VOLATILE
| CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT | STRICT
| [ EXTERNAL ] SECURITY INVOKER | [ EXTERNAL ] SECURITY DEFINER
| AS 'definition'
| AS 'obj_file', 'link_symbol'
} ...
[ WITH ( attribute [, ...] ) ]
It seems that this syntax is distinct from the current syntax and that the
parser could support both. Thoughts?
I wrote [ in an off-list reply to Mark Dilger ]:
I don't think this solves the parsing problem at all. The problem as I
see it is that given
CREATE AGGREGATE foo (bar ...
it's not obvious whether bar is a def_elem name (old syntax) or a type
name (new syntax). It's possible that we can get bison to eat both
anyway on the basis that the lookahead token must be '=' at this point
for old syntax while it could not be '=' for new syntax.
I did some idle investigation of this and found that it's indeed
possible, as long as we make the further restriction that none of the
definition-list keywords used by old-style CREATE AGGREGATE be keywords
of the SQL grammar. (We could probably allow selected ones if we had
to, but using ColLabel rather than IDENT in the patch below leads to
tons of reduce/reduce conflicts...) Attached is a proof-of-concept
patch, which doesn't do anything useful as-is because none of the rest
of the backend has been updated, but it does prove that bison can be
made to handle CREATE AGGREGATE syntax with an initial list of type
names. For instance the first example in
http://www.postgresql.org/docs/8.1/static/xaggr.html
would become
CREATE AGGREGATE complex_sum (complex)
(
sfunc = complex_add,
stype = complex,
initcond = '(0,0)'
);
I'm inclined to flesh this out and apply it with or without any further
work by Sergey, simply because it makes the syntax of CREATE AGGREGATE
more in harmony with DROP AGGREGATE and the other AGGREGATE commands.
Any objections out there?
Another thing we could look into is doing something similar to CREATE
OPERATOR, so that it names the new operator the same way you would do
in DROP OPERATOR. Not sure if this is worth the trouble or not, as
I don't find DROP OPERATOR amazingly intuitive.
regards, tom lane
Import Notes
Reply to msg id not found: 25775.1145048203@sss.pgh.pa.us
I wrote:
... Polya's Inventors' Paradox states that
"the more general problem may be easier to solve", and I've found that
usually holds up in program design too.
While fooling around with the grammar patch that I showed earlier today,
I had an epiphany that might serve as illustration of the above. We
have traditionally thought of COUNT(*) as an "aggregate over any base
type". But wouldn't it be cleaner to think of it as an aggregate over
zero inputs? That would get rid of the rather artificial need to
convert COUNT(*) to COUNT(1). We would actually have two separate
aggregate functions, which could most accurately be described as
count()
count(anyelement)
where the latter is the form that has the behavior of counting the
non-null values of the input.
While this doesn't really simplify nodeAgg.c, it wouldn't add any
complexity either (once the code has been recast to support variable
numbers of arguments). And it seems to me that it clarifies the
semantics noticeably --- in particular, there'd no longer be this weird
special case that an aggregate over ANY should have a one-input
transition function where everything else takes two-input. The rule
would be simple: an N-input aggregate uses an N-plus-one-input
transition function.
regards, tom lane
On Sat, Apr 15, 2006 at 12:51:24AM -0400, Tom Lane wrote:
I wrote:
... Polya's Inventors' Paradox states that
"the more general problem may be easier to solve", and I've found that
usually holds up in program design too.While fooling around with the grammar patch that I showed earlier today,
I had an epiphany that might serve as illustration of the above. We
have traditionally thought of COUNT(*) as an "aggregate over any base
type". But wouldn't it be cleaner to think of it as an aggregate over
zero inputs? That would get rid of the rather artificial need to
convert COUNT(*) to COUNT(1). We would actually have two separate
aggregate functions, which could most accurately be described as
count()
count(anyelement)
where the latter is the form that has the behavior of counting the
non-null values of the input.While this doesn't really simplify nodeAgg.c, it wouldn't add any
complexity either (once the code has been recast to support variable
numbers of arguments). And it seems to me that it clarifies the
semantics noticeably --- in particular, there'd no longer be this weird
special case that an aggregate over ANY should have a one-input
transition function where everything else takes two-input. The rule
would be simple: an N-input aggregate uses an N-plus-one-input
transition function.
Speaking strictly from a users PoV, I'm not sure this is a great idea,
since it encourages non-standard code (AFAIK no one else accepts
'count()'), and getting rid of support for count(*) seems like a
non-starter, so I'm not sure there's any benefit.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
"Jim C. Nasby" <jnasby@pervasive.com> writes:
On Sat, Apr 15, 2006 at 12:51:24AM -0400, Tom Lane wrote:
I had an epiphany that might serve as illustration of the above. We
have traditionally thought of COUNT(*) as an "aggregate over any base
type". But wouldn't it be cleaner to think of it as an aggregate over
zero inputs?
Speaking strictly from a users PoV, I'm not sure this is a great idea,
since it encourages non-standard code (AFAIK no one else accepts
'count()'), and getting rid of support for count(*) seems like a
non-starter, so I'm not sure there's any benefit.
Well, if you want, we can still insist that actual invocations of a
zero-argument aggregate be spelled with (*). But from a conceptual and
documentation standpoint we should think of them as zero-argument,
not sort-of-one-argument.
regards, tom lane