wip: functions median and percentile
Hello
I am sending a prototype implementation of functions median and
percentile. This implementation is very simple and I moved it to
contrib for this moment - it is more easy maintainable. Later I'll
move it to core.
These functions are relative simple, there are not barrier for
implementation own specific mutations of this functions - so I propose
move to core only basic and well known form of these to core.
postgres=# select median(v) from generate_series(1,10) g(v);
median
────────
5.5
(1 row)
Time: 1.475 ms
postgres=# select percentile(v,50) from generate_series(1,10) g(v);
percentile
────────────
5
(1 row)
Time: 0.626 ms
This implementation is based on tuplesort and the speed is relative
well - the result from 1000000 rows is less 1 sec.
Regards
Pavel Stehule
Attachments:
median.difftext/x-patch; charset=US-ASCII; name=median.diffDownload+298-0
On Thu, Aug 19, 2010 at 11:59 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
I am sending a prototype implementation of functions median and
percentile. This implementation is very simple and I moved it to
contrib for this moment - it is more easy maintainable. Later I'll
move it to core.
So if the entire result set fits in memory it would be nice to use the
O(n) Quickselect algorithm -- which would only be a small change to
the existing Quicksort code -- instead of sorting the entire set. That
should be possible both for percentile() and median(). It would be
good to generalize the tuplesort api sufficiently so that you can
implement this as a contrib module even if we eventually decide to
integrate it in core. It's probably worth having two copies of the
sort code, one for Quickselect and one for Quicksort just for speed,
though I suppose it's worth benchmarking.
But I'm not aware of a generalization of tapesort to allow O(n)
selection with the sequential i/o properties of tapesort. It would be
especially interesting, probably more so than the in-memory case.
--
greg
Greg Stark <gsstark@mit.edu> writes:
On Thu, Aug 19, 2010 at 11:59 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
I am sending a prototype implementation of functions median and
percentile. This implementation is very simple and I moved it to
contrib for this moment - it is more easy maintainable. Later I'll
move it to core.
So if the entire result set fits in memory it would be nice to use the
O(n) Quickselect algorithm -- which would only be a small change to
the existing Quicksort code -- instead of sorting the entire set.
That seems like rather a lot of added infrastructure for functions whose
popularity is at best unknown. I think we should KISS for the first
implementation.
regards, tom lane
On Thu, Aug 19, 2010 at 11:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Greg Stark <gsstark@mit.edu> writes:
On Thu, Aug 19, 2010 at 11:59 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
I am sending a prototype implementation of functions median and
percentile. This implementation is very simple and I moved it to
contrib for this moment - it is more easy maintainable. Later I'll
move it to core.So if the entire result set fits in memory it would be nice to use the
O(n) Quickselect algorithm -- which would only be a small change to
the existing Quicksort code -- instead of sorting the entire set.That seems like rather a lot of added infrastructure for functions whose
popularity is at best unknown. I think we should KISS for the first
implementation.
+1. I think the functions are useful, but the perfect is the enemy of the good.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company
On Thu, Aug 19, 2010 at 12:49:45PM -0400, Robert Haas wrote:
On Thu, Aug 19, 2010 at 11:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Greg Stark <gsstark@mit.edu> writes:
On Thu, Aug 19, 2010 at 11:59 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
I am sending a prototype implementation of functions median and
percentile. This implementation is very simple and I moved it to
contrib for this moment - it is more easy maintainable. Later I'll
move it to core.So if the entire result set fits in memory it would be nice to use the
O(n) Quickselect algorithm -- which would only be a small change to
the existing Quicksort code -- instead of sorting the entire set.That seems like rather a lot of added infrastructure for functions whose
popularity is at best unknown. �I think we should KISS for the first
implementation.+1. I think the functions are useful, but the perfect is the enemy of the good.
Percentile is already there as NTILE, a windowing function. Median
may be useful, but we pretty much can't just call it "median."
Instead, we need to call it something like "left_median" or
"arithmetic_median."
Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics
Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
David Fetter <david@fetter.org> wrote:
Median may be useful, but we pretty much can't just call it
"median." Instead, we need to call it something like "left_median"
or "arithmetic_median."
I think it would be reasonable, and perhaps preferable, to use just
"median" for the semantics described in most dictionaries -- for
example, this:
http://www.merriam-webster.com/dictionary/median
If you do a google search for "median" and poke around, you'll find
many places where this is the only definition mentioned; the others
seem to be rather infrequently used. Why not make the commone usage
convenient?
-Kevin
On Thu, Aug 19, 2010 at 12:12:12PM -0500, Kevin Grittner wrote:
David Fetter <david@fetter.org> wrote:
Median may be useful, but we pretty much can't just call it
"median." Instead, we need to call it something like "left_median"
or "arithmetic_median."I think it would be reasonable, and perhaps preferable, to use just
"median" for the semantics described in most dictionaries -- for
example, this:http://www.merriam-webster.com/dictionary/median
If you do a google search for "median" and poke around, you'll find
many places where this is the only definition mentioned; the others
seem to be rather infrequently used. Why not make the commone usage
convenient?
The reason not to is the same reason that MEDIAN doesn't appear in the
SQL standard, namely that what's common in one field is wrong in
another.
Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics
Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
David Fetter <david@fetter.org> writes:
On Thu, Aug 19, 2010 at 12:12:12PM -0500, Kevin Grittner wrote:
http://www.merriam-webster.com/dictionary/median
If you do a google search for "median" and poke around, you'll find
many places where this is the only definition mentioned; the others
seem to be rather infrequently used. Why not make the commone usage
convenient?
The reason not to is the same reason that MEDIAN doesn't appear in the
SQL standard, namely that what's common in one field is wrong in
another.
Hmm, do you have any *evidence* that that's why it's not in the standard?
My own take on that is that it's reasonably probable that the SQL
committee might standardize a function by that name someday. What we
need to be worrying about is the prospect that if there are multiple
definitions for the term, they might pick a different one than we did.
A name like "arithmetic_median" seems much less likely to get blindsided
by future standardization.
regards, tom lane
2010/8/19 David Fetter <david@fetter.org>:
On Thu, Aug 19, 2010 at 12:49:45PM -0400, Robert Haas wrote:
On Thu, Aug 19, 2010 at 11:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Greg Stark <gsstark@mit.edu> writes:
On Thu, Aug 19, 2010 at 11:59 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
I am sending a prototype implementation of functions median and
percentile. This implementation is very simple and I moved it to
contrib for this moment - it is more easy maintainable. Later I'll
move it to core.So if the entire result set fits in memory it would be nice to use the
O(n) Quickselect algorithm -- which would only be a small change to
the existing Quicksort code -- instead of sorting the entire set.That seems like rather a lot of added infrastructure for functions whose
popularity is at best unknown. I think we should KISS for the first
implementation.+1. I think the functions are useful, but the perfect is the enemy of the good.
Percentile is already there as NTILE, a windowing function. Median
I don't think it. The purpose is same, but usage is different -
aggregate functions can be used as window func, but window functions
cannot be used as aggregates.
may be useful, but we pretty much can't just call it "median."
Instead, we need to call it something like "left_median" or
"arithmetic_median."
I put some time to deep searching in this area - and I talked with my
kolegues mathematican and everywhere use a just median. Some longer or
different name isn't barrier for me, but people can be confused.
Regards
Pavel
Show quoted text
Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.icsRemember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
2010/8/19 David Fetter <david@fetter.org>:
On Thu, Aug 19, 2010 at 12:12:12PM -0500, Kevin Grittner wrote:
David Fetter <david@fetter.org> wrote:
Median may be useful, but we pretty much can't just call it
"median." Instead, we need to call it something like "left_median"
or "arithmetic_median."I think it would be reasonable, and perhaps preferable, to use just
"median" for the semantics described in most dictionaries -- for
example, this:http://www.merriam-webster.com/dictionary/median
If you do a google search for "median" and poke around, you'll find
many places where this is the only definition mentioned; the others
seem to be rather infrequently used. Why not make the commone usage
convenient?The reason not to is the same reason that MEDIAN doesn't appear in the
SQL standard, namely that what's common in one field is wrong in
another.
I think some else. The reason can be more simple - implementation of
median is significantly harder then other aggregates.
I looked there and Oracle11g use "median" in common sense.
Regards
Pavel Stehule
Show quoted text
Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.icsRemember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
On Thu, Aug 19, 2010 at 01:25:36PM -0400, Tom Lane wrote:
David Fetter <david@fetter.org> writes:
On Thu, Aug 19, 2010 at 12:12:12PM -0500, Kevin Grittner wrote:
http://www.merriam-webster.com/dictionary/median
If you do a google search for "median" and poke around, you'll find
many places where this is the only definition mentioned; the others
seem to be rather infrequently used. Why not make the commone usage
convenient?The reason not to is the same reason that MEDIAN doesn't appear in the
SQL standard, namely that what's common in one field is wrong in
another.Hmm, do you have any *evidence* that that's why it's not in the standard?
Only from Joe Celko, who was on the standards committee, and has
written on the subject. There's nothing I've found in the standard
itself for this, if that's what you're looking for.
My own take on that is that it's reasonably probable that the SQL
committee might standardize a function by that name someday. What
we need to be worrying about is the prospect that if there are
multiple definitions for the term, they might pick a different one
than we did.
Exactly :)
A name like "arithmetic_median" seems much less likely to get
blindsided by future standardization.
Yep.
Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics
Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
Pavel Stehule <pavel.stehule@gmail.com> wrote:
I looked there and Oracle11g use "median" in common sense.
As does Sybase IQ. FWIW, Excel spreadsheets do, too.
The chance of the SQL committee picking some other meaning for bare
MEDIAN seems vanishingly small; although I have to grant that with
only Oracle and Sybase as a precedent in the SQL world, it's not
zero.
-Kevin
David Fetter <david@fetter.org> writes:
On Thu, Aug 19, 2010 at 01:25:36PM -0400, Tom Lane wrote:
A name like "arithmetic_median" seems much less likely to get
blindsided by future standardization.
Yep.
OTOH, if Pavel's right that Oracle already has an aggregate named
median(), it seems quite unlikely that the SQL committee would pick
a definition incompatible with Oracle's to standardize on.
regards, tom lane
2010/8/19 Pavel Stehule <pavel.stehule@gmail.com>:
Hello
I am sending a prototype implementation of functions median and
percentile. This implementation is very simple and I moved it to
contrib for this moment - it is more easy maintainable. Later I'll
move it to core.
I've reviewed this patch.
* The patch can apply cleanly and make doesn't print any errors nor
warnings. But it doesn't touch contrib/Makefile so I had to make by
changing dir to contrib/median.
* Cosmetic coding style should be more appropriate, including trailing
white spaces and indentation positions.
* Since these two aggregates use tuplesort inside it, there're
possible risk to cause out of memory in normal use case. I don't think
this fact is critical, but at least some notation should be referred
in docs.
* It doesn't contain any document nor regression tests.
* They should be callable in window function context; for example
contrib_regression=# select median(a) over (order by a) from t1;
ERROR: invalid tuplesort state
or at least user-friend message should be printed.
* The returned type is controversy. median(int) returns float8 as the
patch intended, but avg(int) returns numeric. AFAIK only avg(float8)
returns float8.
* percentile() is more problematic; first, the second argument for the
aggregate takes N of "N%ile" as int, like 50 if you want 50%ile, but
it doesn't care about negative values or more than 100. In addition,
the second argument is taken at the first non-NULL value of the first
argument, but the second argument is semantically constant. For
example, for (1.. 10) value of a in table t1,
contrib_regression=# select percentile(a, a * 10 order by a) from t1;
percentile
------------
1
(1 row)
contrib_regression=# select percentile(a, a * 10 order by a desc) from t1;
percentile
------------
10
(1 row)
and if the argument comes from the subquery which doesn't contain
ORDER BY clause, you cannot predict the result.
That's all of my quick review up to now.
Regards,
--
Hitoshi Harada
On Mon, Sep 20, 2010 at 12:47:09PM +0900, Hitoshi Harada wrote:
2010/8/19 Pavel Stehule <pavel.stehule@gmail.com>:
Hello
I am sending a prototype implementation of functions median and
percentile. This implementation is very simple and I moved it to
contrib for this moment - it is more easy maintainable. Later I'll
move it to core.I've reviewed this patch.
[review elided]
Pavel,
Will you have time to get this cleaned up this week per Hitoshi's
review?
If not, I'll mark it "returned with feedback," and move it to
the next Commitfest.
Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics
Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
2010/9/20 David Fetter <david@fetter.org>:
On Mon, Sep 20, 2010 at 12:47:09PM +0900, Hitoshi Harada wrote:
2010/8/19 Pavel Stehule <pavel.stehule@gmail.com>:
Hello
I am sending a prototype implementation of functions median and
percentile. This implementation is very simple and I moved it to
contrib for this moment - it is more easy maintainable. Later I'll
move it to core.I've reviewed this patch.
[review elided]
Pavel,
Will you have time to get this cleaned up this week per Hitoshi's
review?If not, I'll mark it "returned with feedback," and move it to
the next Commitfest.
I hope, so I'll some time this week. Not sure about percentille issues
- but minimally implementation of median can be prepared.
Regards
Pavel
Show quoted text
Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.icsRemember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
Hello
I found probably hard problem in cooperation with window functions :(
tuplesort_begin_datum creates child context inside aggcontext. This
context is used for tuplestore. But when this function is called from
WindowAgg_Aggregates context - someone drops all child context every
iteration, and then tuplestore state is invalid. For this moment we
can block using a median function as window function, but it should be
solved better - if it is possible - I don't see inside window function
implementation.
2010/9/20 Hitoshi Harada <umi.tanuki@gmail.com>:
2010/8/19 Pavel Stehule <pavel.stehule@gmail.com>:
Hello
I am sending a prototype implementation of functions median and
percentile. This implementation is very simple and I moved it to
contrib for this moment - it is more easy maintainable. Later I'll
move it to core.I've reviewed this patch.
* The patch can apply cleanly and make doesn't print any errors nor
warnings. But it doesn't touch contrib/Makefile so I had to make by
changing dir to contrib/median.
fixed
* Cosmetic coding style should be more appropriate, including trailing
white spaces and indentation positions.
can you specify where, please?
* Since these two aggregates use tuplesort inside it, there're
possible risk to cause out of memory in normal use case. I don't think
this fact is critical, but at least some notation should be referred
in docs.
it is same as windows function, no? So the risk is same.
* It doesn't contain any document nor regression tests.
yes, it is my fault, some median regress test appended, documentation
I am will sending later - resp. if you agree with current form of
patch, I'll finalize this patch and remove "median" to core. I put
these functions to contrib just for simple testing of proof concept.
* They should be callable in window function context; for example
contrib_regression=# select median(a) over (order by a) from t1;
ERROR: invalid tuplesort stateor at least user-friend message should be printed.
the problem is in windows function implementation - partially fixed -
blocked from windows context
* The returned type is controversy. median(int) returns float8 as the
patch intended, but avg(int) returns numeric. AFAIK only avg(float8)
returns float8.
fixed
* percentile() is more problematic; first, the second argument for the
aggregate takes N of "N%ile" as int, like 50 if you want 50%ile, but
it doesn't care about negative values or more than 100. In addition,
the second argument is taken at the first non-NULL value of the first
argument, but the second argument is semantically constant. For
example, for (1.. 10) value of a in table t1,
yes, I understand - I don't have a problem to modify it. Just this has
same behave as string_agg. This is again deeper problem - we cannot
ensure so some parameter of aggregation function will be a constant. -
so it cannot be well solved now :( Have you some idea? There isn't any
tool for checking if some parameter is constant or not.
I removed percentile now.
Show quoted text
contrib_regression=# select percentile(a, a * 10 order by a) from t1;
percentile
------------
1
(1 row)contrib_regression=# select percentile(a, a * 10 order by a desc) from t1;
percentile
------------
10
(1 row)and if the argument comes from the subquery which doesn't contain
ORDER BY clause, you cannot predict the result.That's all of my quick review up to now.
Regards,
--
Hitoshi Harada
Attachments:
median.difftext/x-patch; charset=US-ASCII; name=median.diffDownload+426-0
On Tue, Sep 21, 2010 at 4:28 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
* Cosmetic coding style should be more appropriate, including trailing
white spaces and indentation positions.can you specify where, please?
I noticed a lot of problems in this area when working on your \ef /
\sf patch, too. Trailing whitespace is nearly always bad, and it's
not hard to find - just grep the diff for it. As for indentation, try
to match the surrounding code.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company
2010/9/21 Robert Haas <robertmhaas@gmail.com>:
On Tue, Sep 21, 2010 at 4:28 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
* Cosmetic coding style should be more appropriate, including trailing
white spaces and indentation positions.can you specify where, please?
I noticed a lot of problems in this area when working on your \ef /
\sf patch, too. Trailing whitespace is nearly always bad, and it's
not hard to find - just grep the diff for it. As for indentation, try
to match the surrounding code.
is updated patch better?
Pavel
Show quoted text
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company
Attachments:
median.difftext/x-patch; charset=US-ASCII; name=median.diffDownload+421-0
2010/9/22 Pavel Stehule <pavel.stehule@gmail.com>:
2010/9/21 Robert Haas <robertmhaas@gmail.com>:
On Tue, Sep 21, 2010 at 4:28 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
* Cosmetic coding style should be more appropriate, including trailing
white spaces and indentation positions.can you specify where, please?
I noticed a lot of problems in this area when working on your \ef /
\sf patch, too. Trailing whitespace is nearly always bad, and it's
not hard to find - just grep the diff for it. As for indentation, try
to match the surrounding code.is updated patch better?
Better, but you should be more careful, for example,
+ tuplesort_performsort(aggstate->sortstate);
+ <-- 1
+ while (tuplesort_getdatum(aggstate->sortstate,
+ true,
+ &value, &isNull))
For #1, please remove horizontal tabs in empty lines.
Regards,
--
Hitoshi Harada