Re: Implementing STDDEV and VARIANCE
Has this gone anywhere?
I'd like to implement stddev and variance aggregates in Postgres. This is a
long standing TODO item.There already has been some discussion on implementing this, see
http://www.postgresql.org/mhonarc/pgsql-hackers/1998-06/msg00175.htmlThere are two definitions for standard deviation and variance:
_
population variance = sigma^2 = SUM(X - X)^2 / Npopulation stddev = sqrt(population variance)
_
sample variance = s^2 = SUM(X - X)^2 / (N-1)sample stddev = sqrt(sample variance)
These statistics can be calculated in one pass when three variables are
kept when scanning through the rows, n, sum(x), sum(x^2). Currently, only
two variables are kept.E.g. avg() is calculated as follows:
sx = 0
n = 0
for every row {
sx = sx + value in row // transition function 1
n = n+1 // transition function 2
}
avg = sum(x) / nstddev / variance might be calculated as follows:
sx = 0
n = 0
sx2 = 0
for every row {
sx = sx + value in row // transition function 1
n = n+1 // transition function 2
sx2 = sx2 + value in row^2 // transition function 3
}
var = (1/n) * (sx2 - (1/n) * sx^2) // Populationor
var = (1/(n-1)) * (sx2 - (1/n) * sx^2) // Sample
and
stddev = sqrt(var)
I've looked through the code and the following things need to be implemented:
1. Add three columns to pg_aggregate for the additional third transition
function.Tom Lane wrote at Fri, 05 Jun 1998 11:24:04 -0400:
All that you need to implement this is room to keep two running
sums instead of one. I haven't looked at pgsql's aggregate functions,
but I'd hope that the working state can be a struct not just a
single number.I saw no other way than adding another transition function and logic, as
this might break user-defined aggregates (are there any around?).2. Add logic to nodeAgg.c to execute the third transition function and
finalize function with three rather than two parameters
3. Add functions f(a,b) that returns a + b^2 for selected types
4. Add four finalize functions to calculate the variance / stddev
5. Update the code for create aggregate, to include the definition of the
third transition function
6. Update the documentationMy questions are:
1. Is this the correct way to continue? What am I missing? Any errors in my
reasoning?
2. I am proposing the names stddev(x) and variance(x) for population and
samplestddev(x) and
samplevariance(x) for sample statistics. Any comments?
3. I'm planning to implement this for types float4, float8 and numeric. Any
other types also? int[2,4,8] don't seem logical, as these would introduce
serious rounding errors.Let me know what you think,
Jeroen
************
--
Bruce Momjian | http://www.op.net/~candle
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Import Notes
Reply to msg id not found: 4.2.0.58.20000123170505.00954560@mail.design.nl
Bruce Momjian wrote:
I created all that sometimes back. Dunno why never added it
to contrib. Will post it another day.
Jan
Has this gone anywhere?
I'd like to implement stddev and variance aggregates in Postgres. This is a
long standing TODO item.There already has been some discussion on implementing this, see
http://www.postgresql.org/mhonarc/pgsql-hackers/1998-06/msg00175.htmlThere are two definitions for standard deviation and variance:
_
population variance = sigma^2 = SUM(X - X)^2 / Npopulation stddev = sqrt(population variance)
_
sample variance = s^2 = SUM(X - X)^2 / (N-1)sample stddev = sqrt(sample variance)
These statistics can be calculated in one pass when three variables are
kept when scanning through the rows, n, sum(x), sum(x^2). Currently, only
two variables are kept.E.g. avg() is calculated as follows:
sx = 0
n = 0
for every row {
sx = sx + value in row // transition function 1
n = n+1 // transition function 2
}
avg = sum(x) / nstddev / variance might be calculated as follows:
sx = 0
n = 0
sx2 = 0
for every row {
sx = sx + value in row // transition function 1
n = n+1 // transition function 2
sx2 = sx2 + value in row^2 // transition function 3
}
var = (1/n) * (sx2 - (1/n) * sx^2) // Populationor
var = (1/(n-1)) * (sx2 - (1/n) * sx^2) // Sample
and
stddev = sqrt(var)
I've looked through the code and the following things need to be implemented:
1. Add three columns to pg_aggregate for the additional third transition
function.Tom Lane wrote at Fri, 05 Jun 1998 11:24:04 -0400:
All that you need to implement this is room to keep two running
sums instead of one. I haven't looked at pgsql's aggregate functions,
but I'd hope that the working state can be a struct not just a
single number.I saw no other way than adding another transition function and logic, as
this might break user-defined aggregates (are there any around?).2. Add logic to nodeAgg.c to execute the third transition function and
finalize function with three rather than two parameters
3. Add functions f(a,b) that returns a + b^2 for selected types
4. Add four finalize functions to calculate the variance / stddev
5. Update the code for create aggregate, to include the definition of the
third transition function
6. Update the documentationMy questions are:
1. Is this the correct way to continue? What am I missing? Any errors in my
reasoning?
2. I am proposing the names stddev(x) and variance(x) for population and
samplestddev(x) and
samplevariance(x) for sample statistics. Any comments?
3. I'm planning to implement this for types float4, float8 and numeric. Any
other types also? int[2,4,8] don't seem logical, as these would introduce
serious rounding errors.Let me know what you think,
Jeroen
************
-- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #