finding medians

Started by Josh Burdickover 23 years ago6 messages
#1Josh Burdick
jburdick@gradient.cis.upenn.edu

At the end of this message is some code I used to find medians.
It's kind of a hack, but approximately works, and is intended as a
somewhat awkward stopgap for people who need to use medians. It
illustrates the limitations of the current aggregate function setup,
which works so nicely for avg() and stddev().
I don't have any good solutions. I tried using a float4[] to store
each element as it's added, but I couldn't get array updates working in
PL/PgSQL, so that didn't help.
Perhaps aggregate functions could be passed an array? Or a cursor,
pointing at the first line? I'm not sure.

Anyways, perhaps it'll be helpful.
Josh

--
Josh Burdick
jburdick@gradient.cis.upenn.edu
http://www.cis.upenn.edu/~jburdick

/* Implementing median-finding in "pure Postgres." Does this by
copying data to a temporary table.

A weakness of this code is that it uses sorting, instead of
Hoare's linear-time median algorithm. Presumably sorting is
implemented so efficiently that it'll be faster than anything
written in PL/PgSQL. (Although Hoare's algorithm implemented
in C would be faster than either.)

BUG: this isn't properly set up to deal with multiple users.
For example, if A computes a median, then B could read the data
from the median_tmp table. Possibly you could fiddle with
transaction isolation levels, or add a user field to median_tmp,
or something else complicated, to prevent this, but for now I'm
not worrying about this.

Written by Josh Burdick (jburdick@gradient.cis.upenn.edu).
Anyone can use this under the same license as Postgres.

20020524, jtb: started. */

drop aggregate median(float4);
drop table median_tmp;
drop sequence median_id;
drop index median_tmp_median_id;
drop function median_sfunc_float4(bigint, float4);
drop function median_finalfunc_float4(bigint);

create sequence median_id;
create table median_tmp (
median_id int,
x float4
);
create index median_tmp_median_id on median_tmp(median_id);

create function median_sfunc_float4
(bigint, float4) returns bigint as '

insert into median_tmp
values (case when $1 = 0 then nextval(''median_id'') else $1 end, $2);

select currval(''median_id'');

' language 'SQL';

create function median_finalfunc_float4
(bigint) returns float4 as '
declare

i bigint;
n bigint;
c refcursor;
m float4;
m1 float4;

begin

n := (select count(*) from median_tmp where median_id = $1);

open c for select x from median_tmp where median_id = $1 order by x;

for i in 1..((n+1)/2) loop
fetch c into m;
end loop;

/* if n is even, fetch the next value, and average the two */
if (n % int8(2) = int8(0)) then
fetch c into m1;
m := (m + m1) / 2;
end if;

delete from median_tmp where median_id = $1;

return m;

end
' language 'plpgsql';

create aggregate median (
basetype = float4,
stype = bigint,
initcond = 0,
sfunc = median_sfunc_float4,
finalfunc = median_finalfunc_float4
);

#2Josh Berkus
josh@agliodbs.com
In reply to: Josh Burdick (#1)
Re: finding medians

Josh,

At the end of this message is some code I used to find medians.
It's kind of a hack, but approximately works, and is intended as a
somewhat awkward stopgap for people who need to use medians. It
illustrates the limitations of the current aggregate function setup,
which works so nicely for avg() and stddev().

Actually, finding the median is one of the classic SQL problems. You can't do
it without 2 passes through the data set, disallowing the use of traditional
aggregate functions. Joe Celko has half a chapter devoted to various methods
of finding the median.

Can I talk you into submitting your code to Techdocs? I'd love to have it
somewhere where it won't get buried in the mailing list archives.

--
-Josh Berkus

#3Dann Corbit
DCorbit@connx.com
In reply to: Josh Berkus (#2)
Re: finding medians

ACK! Sorting to find a median is criminal.

"Introduction to Algorithms" by Thomas H. Cormen, Charles E. Leiserson,
Ronald L. Rivest
ISBN: 0262031418
explains the better algorithm very well.

Here is a freely available C++ template (written by me) for a bunch of
statistics (everything *but* the selection problem):
ftp://cap.connx.com/pub/tournament_software/STATS.HPP

It uses this template for improved summation accuracy:
ftp://cap.connx.com/pub/tournament_software/Kahan.Hpp

Here is an outline for selection. I wrote it in C++, but a rewrite to C
is trivial:

// Quickselect: find Kth smallest of first N items in array A
// recursive routine finds Kth smallest in A[Low..High]
// Etype: must have copy constructor, oeprator=, and operator<
// Nonrecursive driver is omitted.

template < class Etype >
void
QuickSelect (Etype A[], int Low, int High, int k)
{
if (Low + Cutoff > High)
InsertionSort (&A[Low], High - Low + 1);
else
{
// Sort Low, Middle, High
int Middle = (Low + High) / 2;

if (A[Middle] < A[Low])
Swap (A[Low], A[Middle]);
if (A[High] < A[Low])
Swap (A[Low], A[High]);
if (A[High] < A[Middle])
Swap (A[Middle], A[High]);

// Place pivot at Position High-1
Etype Pivot = A[Middle];
Swap (A[Middle], A[High - 1]);

// Begin partitioning
int i, j;
for (i = Low, j = High - 1;;)
{
while (A[++i] < Pivot);
while (Pivot < A[--j]);
if (i < j)
Swap (A[i], A[j]);
else
break;
}

// Restore pivot
Swap (A[i], A[High - 1]);

// Recurse: only this part changes
if (k < i)
QuickSelect (A, Low, i - 1, k);
else if (k > i)
QuickSelect (A, i + 1, High, k);
}
}

template < class Etype >
void
QuickSelect (Etype A[], int N, int k)
{
QuickSelect (A, 0, N - 1, k - 1);
}

If you want to use this stuff to improve statistics with vacuum, be my
guest.

#4Dann Corbit
DCorbit@connx.com
In reply to: Dann Corbit (#3)
Re: finding medians

Here is a program written in C that demonstrates 2 median/selection
computation techniques:
ACM Algorithm 727 (implementation by Sherif Hashem)
QuickSelect (implemented by me).

Since it is written in C, it would be useful to PostgreSQL project
without any fanfare.
ftp://cap.connx.com/pub/chess-engines/new-approach/727.c

The ACM agorithm 727 is an approximation.
The QuickSelect result is exact.

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Burdick (#1)
Re: finding medians

Josh Burdick <jburdick@gradient.cis.upenn.edu> writes:

illustrates the limitations of the current aggregate function setup,
which works so nicely for avg() and stddev().
I don't have any good solutions. I tried using a float4[] to store
each element as it's added, but I couldn't get array updates working in
PL/PgSQL, so that didn't help.
Perhaps aggregate functions could be passed an array? Or a cursor,
pointing at the first line? I'm not sure.

I don't think that would help. The real problem here is the amount of
internal storage needed. AFAIK there are no exact algorithms for finding
the median that require less than O(N) workspace for N input items.
Your "hack" with a temporary table is not a bad approach if you want to
work for large N.

There are algorithms out there for finding approximate medians using
limited workspace; it might be reasonable to transform one of these into
a Postgres aggregate function.

regards, tom lane

#6Dann Corbit
DCorbit@connx.com
In reply to: Tom Lane (#5)
Re: finding medians

-----Original Message-----
From: Hannu Krosing [mailto:hannu@tm.ee]
Sent: Thursday, May 30, 2002 1:17 PM
To: Josh Burdick
Cc: pgsql-hackers@postgresql.org; josh@agliodbs.com
Subject: Re: [HACKERS] finding medians

On Fri, 2002-05-31 at 01:16, Josh Burdick wrote:

BUG: this isn't properly set up to deal with multiple users.
For example, if A computes a median, then B could read the data
from the median_tmp table. Possibly you could fiddle with
transaction isolation levels, or add a user field to median_tmp,
or something else complicated, to prevent this, but for now I'm
not worrying about this.

You could just use temp tables and indexes - they are local to
connection .

create TEMP sequence median_id;
create TEMP table median_tmp (
median_id int,
x float4
);

Another pure SQL solution would be to perform an order by on the column
of interest.

Use a cursor to seek to the middle element. If the data set is odd,
then the median is the center element. If the data set is even, the
median is the average of the two center elements.

A SQL function would be pretty easy. Of course, it is not the most
efficient way to do it. A nice thing about having the data ordered is
that you can then extract the statistic for any kth partition. Hence,
you could generate a function quantile() and call quantile (0.5) to get
the median. If you have a query:
select quantile(.1, col), quantile(.2, col), quantile(.3,col), ...
quantile(.9, col) from sometable
you only have to do the sort once and then operate on the sorted data.
For queries like that, sorting is probably just fine, since the
selection algorithm is only approximately linear for each single
instance.