aggregate function for median calculation

Started by Thalis A. Kalfigopoulosalmost 25 years ago12 messagesgeneral
Jump to latest
#1Thalis A. Kalfigopoulos
thalis@cs.pitt.edu

Hippl,
I'm interested in calculating the median of a set of numbers. The algorithm requires that all values are known in advance (ie stored in an array). So the question is: how can I store everything first in an array so I can later process it given that I'd like this to be an aggregate function. I thought of creating an aggregate function and have the state_function() gather all the values of a group in an array and the final_function() to do the actuall median calculation on this array. But the intermmediate state cannot hold multiple values in an array (can it?)
Any ideas on how to go with this?

TIA,
thalis

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thalis A. Kalfigopoulos (#1)
Re: aggregate function for median calculation

"Thalis A. Kalfigopoulos" <thalis@cs.pitt.edu> writes:

But the intermmediate state cannot hold multiple values in an array
(can it?)

Sure, why not? Might not scale too well to lots of values, however.

regards, tom lane

#3Alex Pilosov
alex@pilosoft.com
In reply to: Thalis A. Kalfigopoulos (#1)
Re: aggregate function for median calculation

On Mon, 18 Jun 2001, Thalis A. Kalfigopoulos wrote:

Hippl,
I'm interested in calculating the median of a set of numbers.
The algorithm requires that all values are known in advance (ie stored
in an array). So the question is: how can I store everything first in
an array so I can later process it given that I'd like this to be an
aggregate function. I thought of creating an aggregate function and
have the state_function() gather all the values of a group in an array
and the final_function() to do the actuall median calculation on this
array. But the intermmediate state cannot hold multiple values in an
array (can it?) Any ideas on how to go with this?

With current architecture, its kinda painful to implement such a function.
Your 'state' function should allocate (palloc) memory for each element
processed and then pfree it when you are done.

-alex

#4Philip Hallstrom
philip@adhesivemedia.com
In reply to: Alex Pilosov (#3)
Re: aggregate function for median calculation

I missed the first part, but if the numbers are rows in a table, why not
do something like:

numrows = select count(*) from table1 where some_condition
median_value = select some_col from table1 where some_condition order by
some_col limit numrows/2, 1

(or something very close to that anyway).

-philip

On Mon, 18 Jun 2001, Alex Pilosov wrote:

Show quoted text

On Mon, 18 Jun 2001, Thalis A. Kalfigopoulos wrote:

Hippl,
I'm interested in calculating the median of a set of numbers.
The algorithm requires that all values are known in advance (ie stored
in an array). So the question is: how can I store everything first in
an array so I can later process it given that I'd like this to be an
aggregate function. I thought of creating an aggregate function and
have the state_function() gather all the values of a group in an array
and the final_function() to do the actuall median calculation on this
array. But the intermmediate state cannot hold multiple values in an
array (can it?) Any ideas on how to go with this?

With current architecture, its kinda painful to implement such a function.
Your 'state' function should allocate (palloc) memory for each element
processed and then pfree it when you are done.

-alex

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl

#5Thalis A. Kalfigopoulos
thalis@cs.pitt.edu
In reply to: Philip Hallstrom (#4)
Re: aggregate function for median calculation

On Mon, 18 Jun 2001, Philip Hallstrom wrote:

I missed the first part, but if the numbers are rows in a table, why not
do something like:

numrows = select count(*) from table1 where some_condition
median_value = select some_col from table1 where some_condition order by
some_col limit numrows/2, 1

(or something very close to that anyway).

-philip

Because I'll be using this median() as an aggregate function and grouping by multiple attributes of different tables. So instead of me doing the aggregation bookkeeping, I'd rather have it as a Pg aggr. function.

cheers,
thalis

Show quoted text

On Mon, 18 Jun 2001, Alex Pilosov wrote:

On Mon, 18 Jun 2001, Thalis A. Kalfigopoulos wrote:

Hippl,
I'm interested in calculating the median of a set of numbers.
The algorithm requires that all values are known in advance (ie stored
in an array). So the question is: how can I store everything first in
an array so I can later process it given that I'd like this to be an
aggregate function. I thought of creating an aggregate function and
have the state_function() gather all the values of a group in an array
and the final_function() to do the actuall median calculation on this
array. But the intermmediate state cannot hold multiple values in an
array (can it?) Any ideas on how to go with this?

With current architecture, its kinda painful to implement such a function.
Your 'state' function should allocate (palloc) memory for each element
processed and then pfree it when you are done.

-alex

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl

#6Thalis A. Kalfigopoulos
thalis@cs.pitt.edu
In reply to: Alex Pilosov (#3)
Re: aggregate function for median calculation

On Mon, 18 Jun 2001, Alex Pilosov wrote:

On Mon, 18 Jun 2001, Thalis A. Kalfigopoulos wrote:

Hippl,
I'm interested in calculating the median of a set of numbers.
The algorithm requires that all values are known in advance (ie stored
in an array). So the question is: how can I store everything first in
an array so I can later process it given that I'd like this to be an
aggregate function. I thought of creating an aggregate function and
have the state_function() gather all the values of a group in an array
and the final_function() to do the actuall median calculation on this
array. But the intermmediate state cannot hold multiple values in an
array (can it?) Any ideas on how to go with this?

With current architecture, its kinda painful to implement such a function.
Your 'state' function should allocate (palloc) memory for each element
processed and then pfree it when you are done.

-alex

So is there a way I can actually have a variable (the array with the elements) maintained between calls of the same function? I'm trying to figure how to design the aggregate's state_function that will do the element accumulation (before actually passing this array to the final to calculate the median).

TIA,
thalis

#7Peter Eisentraut
peter_e@gmx.net
In reply to: Thalis A. Kalfigopoulos (#6)
Re: aggregate function for median calculation

Thalis A. Kalfigopoulos writes:

So is there a way I can actually have a variable (the array with the elements) maintained between calls of the same function? I'm trying to figure how to design the aggregate's state_function that will do the element accumulation (before actually passing this array to the final to calculate the median).

Sure, you create a (static) global variable and reallocate memory for it
in each call and free it by the finalizer function.

--
Peter Eisentraut peter_e@gmx.net http://funkturm.homeip.net/~peter

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#7)
Re: aggregate function for median calculation

Peter Eisentraut <peter_e@gmx.net> writes:

Sure, you create a (static) global variable and reallocate memory for it
in each call and free it by the finalizer function.

A static would be a bad idea (consider a query with multiple instances
of this aggregate being evaluated in parallel).

But there's no reason that the transition function can't palloc a larger
and larger chunk of memory for each result (as long as you don't run out
of memory, anyway).

regards, tom lane

#9Thalis A. Kalfigopoulos
thalis@cs.pitt.edu
In reply to: Tom Lane (#8)
Re: aggregate function for median calculation

On Tue, 19 Jun 2001, Tom Lane wrote:

Peter Eisentraut <peter_e@gmx.net> writes:

Sure, you create a (static) global variable and reallocate memory for it
in each call and free it by the finalizer function.

A static would be a bad idea (consider a query with multiple instances
of this aggregate being evaluated in parallel).

But there's no reason that the transition function can't palloc a larger
and larger chunk of memory for each result (as long as you don't run out
of memory, anyway).

regards, tom lane

I'm still a bit confused about how to declare the type of the transition state.
I have a structure that will hold the current state:
struct state {
int elem_count; //how many numbers have been assigned so far
int *elem_area; //ptr to area holding these integers whose median I'll later calculate
}

So the transition function is:
struct state *trans_function(struct state *last_state,int new_element){
//Allocate mem for a new state structure to hold all previous numbers + new one
//Copy over all the previous elements from last_state and add the new_element
//Return ptr to this new state struct
}

My question is: how do I declare this in Pg?
declare function transition_func(???,int4) RETURNS ??? AS 'path_to_.so_file','trans_function' LANGUAGE 'C';

I assume that I have mixed up what can be a transition state :-/

Any help welcome.

TIA,
thalis

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thalis A. Kalfigopoulos (#9)
Re: aggregate function for median calculation

"Thalis A. Kalfigopoulos" <thalis@cs.pitt.edu> writes:

I'm still a bit confused about how to declare the type of the transition state.
I have a structure that will hold the current state:
struct state {
int elem_count; //how many numbers have been assigned so far
int *elem_area; //ptr to area holding these integers whose median I'll later calculate
}

My question is: how do I declare this in Pg?

Since you haven't made this a full-fledged Postgres type, you don't.

There's no real *need* to make it a full-fledged type, actually, since
nobody except your two aggregate functions will ever manipulate the
value. So I'd suggest cheating. Make it a legal "varlena" value by
having the first word of the struct contain the total length in bytes
of the object (including itself). Then pretend in the SQL transition
function declaration and aggregate declaration that the transition
datatype is any old varlena type --- text or bytea would do fine.
Nothing except your code will look at anything except the varlena
length, so the fact that the body of the value means something unusual
won't matter.

Or, if you want to be rigidly correct, you could make the transition
value be a legitimate int4 array (_int4), which is only a field or
three more overhead. But I don't see any value in it, except possibly
for manual testing of your transition functions.

regards, tom lane

#11Thalis A. Kalfigopoulos
thalis@cs.pitt.edu
In reply to: Tom Lane (#10)
Re: aggregate function for median calculation

On Thu, 21 Jun 2001, Tom Lane wrote:

"Thalis A. Kalfigopoulos" <thalis@cs.pitt.edu> writes:

I'm still a bit confused about how to declare the type of the transition state.
I have a structure that will hold the current state:
struct state {
int elem_count; //how many numbers have been assigned so far
int *elem_area; //ptr to area holding these integers whose median I'll later calculate
}

My question is: how do I declare this in Pg?

Since you haven't made this a full-fledged Postgres type, you don't.

There's no real *need* to make it a full-fledged type, actually, since
nobody except your two aggregate functions will ever manipulate the
value. So I'd suggest cheating. Make it a legal "varlena" value by
having the first word of the struct contain the total length in bytes
of the object (including itself). Then pretend in the SQL transition
function declaration and aggregate declaration that the transition
datatype is any old varlena type --- text or bytea would do fine.
Nothing except your code will look at anything except the varlena
length, so the fact that the body of the value means something unusual
won't matter.

Or, if you want to be rigidly correct, you could make the transition
value be a legitimate int4 array (_int4), which is only a field or
three more overhead. But I don't see any value in it, except possibly
for manual testing of your transition functions.

regards, tom lane

I worked on it so the struct state now has 3 fields: int length, int elem_count, int *elements. Now the issue is that the transition function, which I declare as strict, has a different input and state type (int4 and text respectively), so it requires an INITCOND. What should that be? It has to be the text representation of what the state struct is initially. The state initially has elem_count=0, elements=NULL and length=sizeof(struct state) which is 3*sizeof(int) (the 2 integer fields and a pointer i.e. another integer). How do I explain that?

I tried some scenarios for INITCOD like '\0' and 0 but they all crashed :-/

Any ideas?

tia,
thalis

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thalis A. Kalfigopoulos (#11)
Re: aggregate function for median calculation

"Thalis A. Kalfigopoulos" <thalis@cs.pitt.edu> writes:

I worked on it so the struct state now has 3 fields: int length, int
elem_count, int *elements. Now the issue is that the transition
function, which I declare as strict, has a different input and state
type (int4 and text respectively), so it requires an INITCOND. What
should that be?

You're going to have to make your code able to start from a value that
can be represented as an INITCOND. Say, an empty string (zero length
word and nothing else). You're cheating, remember?

regards, tom lane