Standard Deviation function.

Started by Chris Albertsonover 27 years ago10 messages

chris@topdog.pas1.logicon.com

over 27 years ago

Hello,

I really need a Standard Deviation aggregate function so I will
try to write one.

I know about the man pages for "create aggregate" and "create
function". Is there something else I should look at?
Just a few pointers could save me a few hours of hunting around.
All advice accepted.

It seems kind of hard to do with only two state functions unless
I "cheat". I need to keep three values, Count, Sum, and Sum of
Squares. I could use three static variables and have the final
function ignore its input and use the static vars instead. This
will likely blow up if the new Standard Deviation aggregate is
used twice in the same select.

Any hints or advice??

If someone has this done already let me know.

I may want do a "median" aggregate function too as I'll need that
later. This would require private storage and a sort.

Could you cc me at both addresses below as I move around between
them

Thanks,

--
--Chris Albertson

chrisja@jps.net
chris@topdog.logicon.com Voice: 626-351-0089 X127
Fax: 626-351-0699

Stupor Genius

stuporg@erols.com

over 27 years ago

In reply to: Chris Albertson (#1)

RE: [HACKERS] Standard Deviation function.

I really need a Standard Deviation aggregate function...

I know about the man pages for "create aggregate" and "create
function". Is there something else I should look at?

It seems kind of hard to do with only two state functions unless
I "cheat". I need to keep three values, Count, Sum, and Sum of
Squares.

Any hints or advice??

I thought about this a long time ago and had an idea but never
got around to trying to implement it. I was going to have some
functions that worked on a structure of two doubles to track
the sum and square instead of using only one simple type.

darrenk

Bruce Momjian

maillist@candle.pha.pa.us

over 27 years ago

In reply to: Stupor Genius (#2)

Re: [HACKERS] Standard Deviation function.

I really need a Standard Deviation aggregate function...

I know about the man pages for "create aggregate" and "create
function". Is there something else I should look at?

It seems kind of hard to do with only two state functions unless
I "cheat". I need to keep three values, Count, Sum, and Sum of
Squares.

Any hints or advice??

I thought about this a long time ago and had an idea but never
got around to trying to implement it. I was going to have some
functions that worked on a structure of two doubles to track
the sum and square instead of using only one simple type.

I remember talking about this to someone, and the problem is that you
needed the average WHILE scanning through the table, which required two
passes, which the aggregate system is not designed to do. I may be
wrong on this, though.

-- 
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Stupor Genius

stuporg@erols.com

over 27 years ago

In reply to: Bruce Momjian (#3)

RE: [HACKERS] Standard Deviation function.

I really need a Standard Deviation aggregate function...

I thought about this a long time ago and had an idea but never
got around to trying to implement it. I was going to have some
functions that worked on a structure of two doubles to track
the sum and square instead of using only one simple type.

I remember talking about this to someone, and the problem is that you
needed the average WHILE scanning through the table, which required two
passes, which the aggregate system is not designed to do. I may be
wrong on this, though.

I had asked you how to calculate this and the variance early last
year. One (I think the variance) was two-pass because of the need
for the average, but I thought the StdDev would work with the struct.

Been a while and I still haven't configured #(*&^ FreeBSD ppp yet.

darrenk

Bruce Momjian

maillist@candle.pha.pa.us

over 27 years ago

In reply to: Stupor Genius (#4)

Re: [HACKERS] Standard Deviation function.

I had asked you how to calculate this and the variance early last
year. One (I think the variance) was two-pass because of the need
for the average, but I thought the StdDev would work with the struct.

Variance is just square of std. dev, no?

Been a while and I still haven't configured #(*&^ FreeBSD ppp yet.

Bummer.

-- 
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Andreas Zeugswetter

andreas.zeugswetter@telecom.at

over 27 years ago

In reply to: Bruce Momjian (#5)

AW: [HACKERS] Standard Deviation function.

I had asked you how to calculate this and the variance early last
year. One (I think the variance) was two-pass because of the need
for the average, but I thought the StdDev would work with the struct.

Variance is just square of std. dev, no?

No ! Stdev is divided by count, Variance by (count - 1)

It was some time ago, but I thing there is a running function that
can be calculated with one pass. I might be able to dig it up somewhere.
I had it in an excel sheet for learning purposes.

Andreas

Import Notes

Resolved by subject fallback

Noname

dg@illustra.com

over 27 years ago

In reply to: Stupor Genius (#4)

Re: [HACKERS] Standard Deviation function.

I really need a Standard Deviation aggregate function...

I thought about this a long time ago and had an idea but never
got around to trying to implement it. I was going to have some
functions that worked on a structure of two doubles to track
the sum and square instead of using only one simple type.

I remember talking about this to someone, and the problem is that you
needed the average WHILE scanning through the table, which required two
passes, which the aggregate system is not designed to do. I may be
wrong on this, though.

I had asked you how to calculate this and the variance early last
year. One (I think the variance) was two-pass because of the need
for the average, but I thought the StdDev would work with the struct.

Been a while and I still haven't configured #(*&^ FreeBSD ppp yet.

The Perl Module "Statistics/Descriptive" has on the fly variance calculation.

sub add_data {
my $self = shift; ##Myself
my $oldmean;
my ($min,$mindex,$max,$maxdex);

##Take care of appending to an existing data set
$min = (defined ($self->{min}) ? $self->{min} : $_[0]);
$max = (defined ($self->{max}) ? $self->{max} : $_[0]);
$maxdex = $self->{maxdex} || 0;
$mindex = $self->{mindex} || 0;

##Calculate new mean, pseudo-variance, min and max;
foreach (@_) {
$oldmean = $self->{mean};
$self->{sum} += $_;
$self->{count}++;
if ($_ >= $max) {
$max = $_;
$maxdex = $self->{count}-1;
}
if ($_ <= $min) {
$min = $_;
$mindex = $self->{count}-1;
}
$self->{mean} += ($_ - $oldmean) / $self->{count};
$self->{pseudo_variance} += ($_ - $oldmean) * ($_ - $self->{mean});
}

$self->{min} = $min;
$self->{mindex} = $mindex;
$self->{max} = $max;
$self->{maxdex} = $maxdex;
$self->{sample_range} = $self->{max} - $self->{min};
if ($self->{count} > 1) {
$self->{variance} = $self->{pseudo_variance} / ($self->{count} -1);
$self->{standard_deviation} = sqrt( $self->{variance});
}
return 1;
}

-dg

David Gould dg@illustra.com 510.628.3783 or 510.305.9468
Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612
"Don't worry about people stealing your ideas. If your ideas are any
good, you'll have to ram them down people's throats." -- Howard Aiken

Andreas Zeugswetter

andreas.zeugswetter@telecom.at

over 27 years ago

In reply to: Noname (#7)

Re: [HACKERS] Standard Deviation function.

David Gould wrote:

The Perl Module "Statistics/Descriptive" has on the fly variance calculation.

sub add_data {
my $self = shift; ##Myself
my $oldmean;
my ($min,$mindex,$max,$maxdex);

##Take care of appending to an existing data set
$min = (defined ($self->{min}) ? $self->{min} : $_[0]);
$max = (defined ($self->{max}) ? $self->{max} : $_[0]);
$maxdex = $self->{maxdex} || 0;
$mindex = $self->{mindex} || 0;

##Calculate new mean, pseudo-variance, min and max;
foreach (@_) {
$oldmean = $self->{mean};
$self->{sum} += $_;
$self->{count}++;
if ($_ >= $max) {
$max = $_;
$maxdex = $self->{count}-1;
}
if ($_ <= $min) {
$min = $_;
$mindex = $self->{count}-1;
}
$self->{mean} += ($_ - $oldmean) / $self->{count};
$self->{pseudo_variance} += ($_ - $oldmean) * ($_ - $self->{mean});
}

$self->{min} = $min;
$self->{mindex} = $mindex;
$self->{max} = $max;
$self->{maxdex} = $maxdex;
$self->{sample_range} = $self->{max} - $self->{min};
if ($self->{count} > 1) {
$self->{variance} = $self->{pseudo_variance} / ($self->{count} -1);
$self->{standard_deviation} = sqrt( $self->{variance});

Wow, this is it. But as I said, the above line is wrong (By the way: this is a very common mistake).
It should read:
$self->{standard_deviation} = sqrt( $self->{pseudo_variance} / $self->{count} )
Note: The - 1 is missing

Show quoted text

}
return 1;
}

Import Notes

Resolved by subject fallback

Tom Lane

tgl@sss.pgh.pa.us

over 27 years ago

In reply to: Andreas Zeugswetter (#8)

Re: [HACKERS] Standard Deviation function.

Andreas Zeugswetter <andreas.zeugswetter@telecom.at> writes:

Wow, this is it. But as I said, the above line is wrong (By the way:
this is a very common mistake).
It should read:
$self->{standard_deviation} = sqrt( $self->{pseudo_variance} / $self->{count} )
Note: The - 1 is missing

The formula with N-1 in the divisor is correct for the "sample standard
deviation". That is what you use when your N data points represent a
sample from a larger population, and you want to estimate the standard
deviation of the whole population.

If your N data points in fact are the *whole* population of interest,
then you calculate the "population standard deviation" which has just N
in the divisor. So both versions of the formula are correct depending
on the situation, and you really ought to provide both.

(To justify the difference intuitively: if you have exactly one data
point, and it is the *whole* population, then the mean equals the
data value and the standard deviation is zero. That is what you get
with N in the divisor. But if your one data point is a sample from
a larger population, you cannot estimate the population's standard
deviation; you need more data. The N-1 equation gives 0/0 in this
case, correctly signifying that the value is indeterminate.)

I think the Perl code given earlier in the thread pretty much sucks
from a numerical accuracy point of view. The running mean calculation
suffers from accumulation of errors, and that propagates into the
pseudo-variance in a big way. It's particularly bad if the data is
tightly clustered about the mean; the code ends up doing lots of
subtractions of nearly equal values.

The accepted way to do sample standard deviation in one pass is this:

STDDEV = SQRT( (N*SIGMA(Xi^2) - SIGMA(Xi)^2) / (N*(N-1)) )

where N is the number of data points and SIGMA(Xi) means the sum
of the data values Xi. You keep running sums of Xi and Xi^2 as
you pass over the data, then you apply the above equation once
at the end. (For population standard deviation, you use N^2 as
the denominator. For variance, you just leave off the SQRT().)

All that you need to implement this is room to keep two running
sums instead of one. I haven't looked at pgsql's aggregate functions,
but I'd hope that the working state can be a struct not just a
single number.

regards, tom lane

Import Notes

Reply to msg id not found: YourmessageofFri5Jun1998104324+020001BD906E.FAFE3D10@zeugswettera.user.lan.at | Resolved by subject fallback

#10

Noname

dj@pelf.harvard.edu

over 27 years ago

In reply to: Tom Lane (#9)

Re: [HACKERS] Standard Deviation function.

For on-the-fly standard deviation calculations with an eye towards
numerical accuracy, check out:

Chan, T.F, and Lewis, J.G., "Computing Standard Deviations: Accuracy",
Communications of the ACM, Vol 22, No. 9, September 1979, p. 526

and

West, D.H.D, "Updating Mean and Variance Estimates: An Improved Method",
Communications of the ACM, Vol 22, No. 9, September 1979, p. 532

The articles were writen when single precision floating point
was the limiting factor, but the principles are just as relevant.

Diab

-------------
Diab Jerius Harvard-Smithsonian Center for Astrophysics
60 Garden St, MS 70, Cambridge MA 02138 USA
djerius@cfa.harvard.edu vox: 617 496 7575 fax: 617 495 7356

Import Notes

Resolved by subject fallback