requesting features in PostgreSQL
I would like to request a new aggregate function for
PostgreSQL.
I need an aggregate function that calculates geometric
mean, and can be used in SQL statements the same as
any other aggregate function such as count(), sum()
and avg(). Rather than have a custom function, I'd
like to see it added as a standard PostgreSQL feature.
Is there a specific PostgreSQL developer that I should
contact? Or is an open request on this list
sufficient for consideration?
The geometric mean function, I'll call it gmean(), is
similar to avg(), except that instead of adding the
individual values, you would multiply them; and
instead of dividing the sum by the sample size, you
would raise the resulting product by a power of (1 /
sample size).
Therefore, the gmean() of the values 1, 2, 3 and 9
would equal:
= (1 * 2 * 3 * 9) ^ (1 / 4)
= 54 ^ 0.25
= 2.710806 # rounded to 6 decimal places
This function differs from arithmetic mean (average)
in that it lessens the affect of outliers without
discounting them altogether.
Thanks,
Andrew Gould
__________________________________________________
Do You Yahoo!?
Yahoo! Games - play chess, backgammon, pool and more
http://games.yahoo.com/
Andrew Gould <andrewgould@yahoo.com> writes:
I need an aggregate function that calculates geometric
mean, and can be used in SQL statements the same as
any other aggregate function such as count(), sum()
and avg(). Rather than have a custom function, I'd
like to see it added as a standard PostgreSQL feature.
One request does not strike me as sufficient reason to make it a
standard feature. Extensibility is what Postgres is all about
--- so go ahead and write your own.
AFAICS this should take about ten minutes to prototype (two simple
plpgsql or pltcl functions and a user-defined aggregate). If you intend
to process very large volumes of data, it might be worth rewriting the
transition function in C for speed. (You could steal the existing
transition function for avg() as a model.)
regards, tom lane
One request is certainly not sufficient for a new
standard feature. As I see more industries becoming
data savvy, however, I think it's time will come.
Until then....
Simple python scripts are the extent of my coding
experience. As I will be processing 2 to 6 million
records within one query, I guess I should learn some
C.
Once I untar PostgreSQL's source, where can I find the
code for avg()? Will it have it's own file?
Thanks,
Andrew
--- Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andrew Gould <andrewgould@yahoo.com> writes:
I need an aggregate function that calculates
geometric
mean, and can be used in SQL statements the same
as
any other aggregate function such as count(),
sum()
and avg(). Rather than have a custom function,
I'd
like to see it added as a standard PostgreSQL
feature.
One request does not strike me as sufficient reason to make it a standard feature. Extensibility is what Postgres is all about --- so go ahead and write your own.AFAICS this should take about ten minutes to
prototype (two simple
plpgsql or pltcl functions and a user-defined
aggregate). If you intend
to process very large volumes of data, it might be
worth rewriting the
transition function in C for speed. (You could
steal the existing
transition function for avg() as a model.)regards, tom lane
---------------------------(end of
broadcast)---------------------------
TIP 2: you can get off all lists at once with the
unregister command
(send "unregister YourEmailAddressHere" to
majordomo@postgresql.org)
__________________________________________________
Do You Yahoo!?
Yahoo! Games - play chess, backgammon, pool and more
http://games.yahoo.com/
Andrew Gould <andrewgould@yahoo.com> writes:
Once I untar PostgreSQL's source, where can I find the
code for avg()? Will it have it's own file?
The C code is in src/backend/utils/adt/float.c --- look at
float8_accum() and float8_avg().
regards, tom lane
I decided to use your problem as a little exercize for myself, so I came up
with a working aggregate for gmean. I posted the code below. I wrote it in
plpgsql. It's possible the "a1" function (the main part of the aggregate)
won't be the bottleneck for performance. You could probably translate to C,
and you can probably solve the problem more gracefully than I, but I have
working code. I couldn't think of what else to use as a state type, so I just
used a two-element array of floats. The first element holds the running
product (i.e. 1*2*3*9) and the second holds the number of records visited (4).
I would appreciate it if you (or anyone else) would post (or direct email)
what changes you make for efficiency, including if you write a c function for
a1 (a2 doesn't really need a C func unless you really want, since it should
only be called once per aggregation). I could probably help you translate to
C if you'd like, but first I'd like to make sure I have the most efficient
algorithm.
Regards,
Jeff
______________________________________
create function a1f(float[2]) returns float as '
BEGIN
return ($1[1]^(1/($1[2])));
END;
' language 'plpgsql';
create function a1(float[2],float) returns float[2] as '
DECLARE
ret float[2];
BEGIN
ret := ''{'' || (($1[1]) * ($2)) || '','' || (($1[2]) + 1) || ''}'';
RETURN ret;
END;
' language 'plpgsql';
create aggregate a2
(basetype=float,sfunc=a1,stype=float[],finalfunc=a1f,initcond='{1.0,0.0}');
Jeff Davis <list-pgsql-general@empires.org> writes:
create function a1(float[2],float) returns float[2] as '
DECLARE
ret float[2];
BEGIN
ret := ''{'' || (($1[1]) * ($2)) || '','' || (($1[2]) + 1) || ''}'';
RETURN ret;
END;
' language 'plpgsql';
We really need better support for arrays in plpgsql :-(. The above will
work, but it invokes conversion of floats to text and back again on
every call; that's slow and will probably cause accumulation of roundoff
errors in the aggregate result.
I tried to do this:
create function a1(float[2],float) returns float[2] as '
declare
ret float[2];
begin
ret := $1;
ret[1] := ret[1] * $2;
ret[2] := ret[2] + 1;
return ret;
end' language plpgsql;
but it failed with syntax errors --- plpgsql doesn't understand the
notation "var[subscript] := something". Someone oughta dig into it
and fix that.
In the meantime I think the most practical way to do this task in
plpgsql is to abuse the built-in "point" type, which can be treated
as an array of 2 floats:
regression=# create function a1(point,float) returns point as '
regression'# begin
regression'# return point($1[0] * $2, $1[1] + 1);
regression'# end' language plpgsql;
CREATE
regression=# select a1('(2,3)'::point, 44);
a1
--------
(88,4)
(1 row)
Note that the subscripts are [0],[1] not [1],[2] ... a bit of legacy
incompatibility ...
regards, tom lane
Yup, I tried the same exact thing :) I wouldn't mind seeing some better
support in plpgsql either. However, anyone can create their own type so I
guess it isn't much of a problem.
I updated my code to use points instead, and more meaningful identifiers.
-----------------------------------------------------------
create function float_gmean(point) returns float as '
BEGIN
return ($1[0]^(1/($1[1])));
END;
' language 'plpgsql';
create function float_gmean_accum(point,float) returns point as '
BEGIN
RETURN point(($1[0]*$2),($1[1]+1));
END;
' language 'plpgsql';
create aggregate gmean
(basetype=float,sfunc=float_gmean_accum,stype=point,finalfunc=float_gmean,initcond='(1.0,0.0)');
-----------------------------------------------------------
Show quoted text
I tried to do this:
create function a1(float[2],float) returns float[2] as '
declare
ret float[2];
begin
ret := $1;
ret[1] := ret[1] * $2;
ret[2] := ret[2] + 1;
return ret;
end' language plpgsql;but it failed with syntax errors --- plpgsql doesn't understand the
notation "var[subscript] := something". Someone oughta dig into it
and fix that.In the meantime I think the most practical way to do this task in
plpgsql is to abuse the built-in "point" type, which can be treated
as an array of 2 floats:regression=# create function a1(point,float) returns point as '
regression'# begin
regression'# return point($1[0] * $2, $1[1] + 1);
regression'# end' language plpgsql;
CREATE
regression=# select a1('(2,3)'::point, 44);
a1
--------
(88,4)
(1 row)Note that the subscripts are [0],[1] not [1],[2] ... a bit of legacy
incompatibility ...regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster
I was looking at the code. Why do you calculate "sumX2" in float8_accum?
Later you have a comment "ignore sumX2" in float8_avg().
Regards,
Jeff
Show quoted text
On Friday 26 April 2002 11:30 am, Tom Lane wrote:
Andrew Gould <andrewgould@yahoo.com> writes:
Once I untar PostgreSQL's source, where can I find the
code for avg()? Will it have it's own file?The C code is in src/backend/utils/adt/float.c --- look at
float8_accum() and float8_avg().regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster
Jeff Davis <list-pgsql-general@empires.org> writes:
I was looking at the code. Why do you calculate "sumX2" in float8_accum?
Because the same accumulator function is also used for stddev.
regards, tom lane
I want to thank you both for your help on this. I
read the C source and the plpgsql below. I think I'm
understanding what I'm seeing.
I will be out of town for several days; but will try
out the plpsql solution next week. (If my luck in
airports holds true, I'll have plenty of time to read
the documentation.)
Thanks again,
Andrew
--- Jeff Davis <list-pgsql-general@empires.org> wrote:
Yup, I tried the same exact thing :) I wouldn't mind
seeing some better
support in plpgsql either. However, anyone can
create their own type so I
guess it isn't much of a problem.I updated my code to use points instead, and more
meaningful identifiers.
-----------------------------------------------------------
create function float_gmean(point) returns float as
'
BEGIN
return ($1[0]^(1/($1[1])));
END;
' language 'plpgsql';create function float_gmean_accum(point,float)
returns point as '
BEGIN
RETURN point(($1[0]*$2),($1[1]+1));
END;
' language 'plpgsql';create aggregate gmean
(basetype=float,sfunc=float_gmean_accum,stype=point,finalfunc=float_gmean,initcond='(1.0,0.0)');
-----------------------------------------------------------
I tried to do this:
create function a1(float[2],float) returns
float[2] as '
declare
ret float[2];
begin
ret := $1;
ret[1] := ret[1] * $2;
ret[2] := ret[2] + 1;
return ret;
end' language plpgsql;but it failed with syntax errors --- plpgsql
doesn't understand the
notation "var[subscript] := something". Someone
oughta dig into it
and fix that.
In the meantime I think the most practical way to
do this task in
plpgsql is to abuse the built-in "point" type,
which can be treated
as an array of 2 floats:
regression=# create function a1(point,float)
returns point as '
regression'# begin
regression'# return point($1[0] * $2, $1[1] + 1);
regression'# end' language plpgsql;
CREATE
regression=# select a1('(2,3)'::point, 44);
a1
--------
(88,4)
(1 row)Note that the subscripts are [0],[1] not [1],[2]
... a bit of legacy
incompatibility ...
regards, tom lane
---------------------------(end of
broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster
__________________________________________________
Do You Yahoo!?
Yahoo! Health - your guide to health and wellness
http://health.yahoo.com