Built-in binning functions

Started by Petr Jelinekalmost 12 years ago32 messageshackers

petr@2ndquadrant.com

almost 12 years ago

Hello,

here is a patch implementing varwidth_bucket (naming is up for
discussion) function which does binning with variable bucket width. The
use-cases are same as for width_bucket (=data analytics, mainly
histograms), the difference is that width_bucket uses buckets of same
width but the varwidth_bucket accepts an sorted array of right-bound
thresholds to define the individual buckets.

Currently applications implement this with long CASE statements which
are quite hard to read/maintain and are much slower than this
implementation which uses binary search.

There are 3 actual functions, one generic and two faster versions for
the int8 and float8 input that take advantage of the static width of
those types.

The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under grant
agreement nï¿½ 318633

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Robert Haas

robertmhaas@gmail.com

almost 12 years ago

In reply to: Petr Jelinek (#1)

Re: Built-in binning functions

On Fri, Jun 13, 2014 at 8:22 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:

here is a patch implementing varwidth_bucket (naming is up for discussion)
function which does binning with variable bucket width. The use-cases are
same as for width_bucket (=data analytics, mainly histograms), the
difference is that width_bucket uses buckets of same width but the
varwidth_bucket accepts an sorted array of right-bound thresholds to define
the individual buckets.

Currently applications implement this with long CASE statements which are
quite hard to read/maintain and are much slower than this implementation
which uses binary search.

There are 3 actual functions, one generic and two faster versions for the
int8 and float8 input that take advantage of the static width of those
types.

I wonder if stuff like this shouldn't live in contrib rather than
core, but I guess it's probably not worth it for 3 functions.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Petr Jelinek

petr@2ndquadrant.com

almost 12 years ago

In reply to: Robert Haas (#2)

Re: Built-in binning functions

On 17/06/14 16:15, Robert Haas wrote:

On Fri, Jun 13, 2014 at 8:22 PM, Petr Jelinek <petr@2ndquadrant.com> wrote:

here is a patch implementing varwidth_bucket (naming is up for discussion)
function which does binning with variable bucket width. The use-cases are
same as for width_bucket (=data analytics, mainly histograms), the
difference is that width_bucket uses buckets of same width but the
varwidth_bucket accepts an sorted array of right-bound thresholds to define
the individual buckets.

I wonder if stuff like this shouldn't live in contrib rather than
core, but I guess it's probably not worth it for 3 functions.

I was wondering the same and I also think that 3 simple functions is not
that much, plus it seems natural extension of the existing width_bucket.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

almost 12 years ago

In reply to: Petr Jelinek (#1)

Re: Built-in binning functions

Petr Jelinek <petr@2ndquadrant.com> writes:

here is a patch implementing varwidth_bucket (naming is up for
discussion) function which does binning with variable bucket width.

I didn't see any discussion of the naming question in this thread.
I'd like to propose that it should be just "width_bucket()"; we can
easily determine which function is meant, considering that the
SQL-spec variants don't take arrays and don't even have the same
number of actual arguments.

Also, the set of functions provided still needs more thought, because the
resolution of overloaded functions doesn't really work as nicely as all
that. I illustrate:

regression=# create function myf(int8) returns int as 'select 0' language sql;
CREATE FUNCTION
regression=# create function myf(float8) returns int as 'select 1' language sql;
CREATE FUNCTION
regression=# create function myf(anyelement) returns int as 'select 2' language sql;
CREATE FUNCTION
regression=# select myf(1);
myf
-----
1
(1 row)

So given plain integer arguments, we'll invoke the float8 version not the
int8 version, which may be undesirable. The same for int2 arguments.
We could fix that by adding bespoke int4 and maybe int2 variants, but
TBH, I'm not sure that the specific-type functions are worth the trouble.
Maybe we should just have one generic function, and take the trouble to
optimize its array-subscripting calculations for either fixed-length or
variable-length array elements? Have you got performance measurements
demonstrating that multiple implementations really buy enough to justify
the extra code?

Also, I'm not convinced by this business of throwing an error for a
NULL array element. Per spec, null arguments to width_bucket()
produce a null result, not an error --- shouldn't this flavor act
similarly? In any case I think the test needs to use
array_contains_nulls() not just ARR_HASNULL.

Lastly, the spec defines behaviors for width_bucket that allow either
ascending or descending buckets. We could possibly do something similar
in these functions by initially comparing the two endpoint elements to see
which one is larger, and then reversing the sense of all the comparisons
if the first one is larger. I'm not sure if that's worth the trouble or
not, but if the SQL committee thought descending bucket numbering was
worthwhile, maybe it's worthwhile here too.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Petr Jelinek

petr@2ndquadrant.com

almost 12 years ago

In reply to: Tom Lane (#4)

Re: Built-in binning functions

On 08/07/14 02:14, Tom Lane wrote:

Petr Jelinek <petr@2ndquadrant.com> writes:

here is a patch implementing varwidth_bucket (naming is up for
discussion) function which does binning with variable bucket width.

I didn't see any discussion of the naming question in this thread.
I'd like to propose that it should be just "width_bucket()"; we can
easily determine which function is meant, considering that the
SQL-spec variants don't take arrays and don't even have the same
number of actual arguments.

I did mention in submission that the names are up for discussion, I am
all for naming it just width_bucket.

So given plain integer arguments, we'll invoke the float8 version not the
int8 version, which may be undesirable. The same for int2 arguments.
We could fix that by adding bespoke int4 and maybe int2 variants, but

Hmm, yeah I don't love the idea of having 3 functions just to cover
integer datatypes :(.

TBH, I'm not sure that the specific-type functions are worth the trouble.
Maybe we should just have one generic function, and take the trouble to
optimize its array-subscripting calculations for either fixed-length or
variable-length array elements? Have you got performance measurements
demonstrating that multiple implementations really buy enough to justify
the extra code?

The performance difference is about 20% (+/- few depending on the array
size), I don't know if that's bad enough to warrant type-specific
implementation. I personally don't know how to make the generic
implementation much faster than it is now, except maybe by turning it
into aggregate which would cache the deconstructed version of the array,
but that change semantics quite a bit and is probably not all that
desirable.

Also, I'm not convinced by this business of throwing an error for a
NULL array element. Per spec, null arguments to width_bucket()
produce a null result, not an error --- shouldn't this flavor act
similarly? In any case I think the test needs to use
array_contains_nulls() not just ARR_HASNULL.

I am not against returning NULL for NULL array, I would still like to
error on array that has values + NULL somewhere in the middle though.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Pavel Stehule

pavel.stehule@gmail.com

almost 12 years ago

In reply to: Petr Jelinek (#5)

Re: Built-in binning functions

2014-07-16 10:04 GMT+02:00 Petr Jelinek <petr@2ndquadrant.com>:

On 08/07/14 02:14, Tom Lane wrote:

Petr Jelinek <petr@2ndquadrant.com> writes:

here is a patch implementing varwidth_bucket (naming is up for
discussion) function which does binning with variable bucket width.

I didn't see any discussion of the naming question in this thread.
I'd like to propose that it should be just "width_bucket()"; we can
easily determine which function is meant, considering that the
SQL-spec variants don't take arrays and don't even have the same
number of actual arguments.

I did mention in submission that the names are up for discussion, I am all
for naming it just width_bucket.

I had this idea too - but I am not sure if it is good idea. A distance
between ANSI SQL with_bucket and our enhancing is larger than in our
implementation of "median" for example.

I can live with both names, but current name I prefer.

So given plain integer arguments, we'll invoke the float8 version not the
int8 version, which may be undesirable. The same for int2 arguments.
We could fix that by adding bespoke int4 and maybe int2 variants, but

Hmm, yeah I don't love the idea of having 3 functions just to cover
integer datatypes :(.

TBH, I'm not sure that the specific-type functions are worth the trouble.

Maybe we should just have one generic function, and take the trouble to
optimize its array-subscripting calculations for either fixed-length or
variable-length array elements? Have you got performance measurements
demonstrating that multiple implementations really buy enough to justify
the extra code?

The performance difference is about 20% (+/- few depending on the array
size), I don't know if that's bad enough to warrant type-specific
implementation. I personally don't know how to make the generic
implementation much faster than it is now, except maybe by turning it into
aggregate which would cache the deconstructed version of the array, but
that change semantics quite a bit and is probably not all that desirable.

I am not sure if our API is enough to do it - there are no any simple
support for immutable parameters.

The performance is one point. Second point is wrong result, when input
array is not well sorted. But this check means next performance
penalization. But we cannot do it.

Also, I'm not convinced by this business of throwing an error for a
NULL array element. Per spec, null arguments to width_bucket()
produce a null result, not an error --- shouldn't this flavor act
similarly? In any case I think the test needs to use
array_contains_nulls() not just ARR_HASNULL.

I am not against returning NULL for NULL array, I would still like to
error on array that has values + NULL somewhere in the middle though.

Pavel

Show quoted text

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Petr Jelinek

petr@2ndquadrant.com

almost 12 years ago

In reply to: Pavel Stehule (#6)

Re: Built-in binning functions

On 16/07/14 21:35, Pavel Stehule wrote:

The performance difference is about 20% (+/- few depending on the
array size), I don't know if that's bad enough to warrant
type-specific implementation. I personally don't know how to make
the generic implementation much faster than it is now, except maybe
by turning it into aggregate which would cache the deconstructed
version of the array, but that change semantics quite a bit and is
probably not all that desirable.

I am not sure if our API is enough to do it - there are no any simple
support for immutable parameters.

Just to clarify, the ~20% performance difference is with separate
generic implementation for fixed width types (most of the time seems to
be spent in the FunctionCallInvoke part, I even tryed to use sortsupport
instead but it does not seem to help much).

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Petr Jelinek

petr@2ndquadrant.com

almost 12 years ago

In reply to: Tom Lane (#4)

Re: Built-in binning functions

On 08/07/14 02:14, Tom Lane wrote:

Also, the set of functions provided still needs more thought, because the
resolution of overloaded functions doesn't really work as nicely as all
that.

I am honestly considering just removing special case for int8 for now,
the sql standard width_bucket has only support for float8 and numeric,
maybe it's enough for this one also...

What's your opinion on that?

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Simon Riggs

simon@2ndQuadrant.com

almost 12 years ago

In reply to: Pavel Stehule (#6)

Re: Built-in binning functions

On 16 July 2014 20:35, Pavel Stehule <pavel.stehule@gmail.com> wrote:

2014-07-16 10:04 GMT+02:00 Petr Jelinek <petr@2ndquadrant.com>:

On 08/07/14 02:14, Tom Lane wrote:

Petr Jelinek <petr@2ndquadrant.com> writes:

here is a patch implementing varwidth_bucket (naming is up for
discussion) function which does binning with variable bucket width.

I didn't see any discussion of the naming question in this thread.
I'd like to propose that it should be just "width_bucket()"; we can
easily determine which function is meant, considering that the
SQL-spec variants don't take arrays and don't even have the same
number of actual arguments.

I did mention in submission that the names are up for discussion, I am all
for naming it just width_bucket.

I had this idea too - but I am not sure if it is good idea. A distance
between ANSI SQL with_bucket and our enhancing is larger than in our
implementation of "median" for example.

I can live with both names, but current name I prefer.

Hmmm, not sure. Let's look around and think what words people use.

Transforms of this type are referred to as discretization in formal
literature and as binning in commong usage/statistical software.

width_bucket() seems to refer to an equal-width binning process. The
function being discussed here is a generic mechanism, the boundaries
of which could have been decided using equal-frequency or other
mechanisms. Using the word "width" in those contexts could be
confusing.

Given width_bucket() is already in use for SQL Standard function, I
agree it would be confusing to have same name for different parameter
profiles.

So I suggest that we use the more generic function name bin(), with a
second choice of discretize() (though that seems annoyingly easy to
spell incorrectly)

Whatever we do, it seems clear we need a section in the manual to
describe Statistical Functions, including width_bucket(), whatever we
call this function and including the terms bin, binning, transform,
discretize and discretization to ensure those are easily searchable.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Petr Jelinek

petr@2ndquadrant.com

almost 12 years ago

In reply to: Simon Riggs (#9)

Re: Built-in binning functions

Hi,

I finally had some time to get back to this.

I attached version3 of the patch which "fixes" Tom's complaint about
int8 version by removing the int8 version as it does not seem necessary
(the float8 can handle integers just fine).

This patch now basically has just one optimized function for float8 and
one for numeric datatypes, just like width_bucket.

On 08/07/14 02:14, Tom Lane wrote:
Also, I'm not convinced by this business of throwing an error for a
NULL array element. Per spec, null arguments to width_bucket()
produce a null result, not an error --- shouldn't this flavor act
similarly? In any case I think the test needs to use
array_contains_nulls() not just ARR_HASNULL.

I really think there should be difference between NULL array and NULL
inside array and that NULL inside the array is wrong input. I changed
the check to array_contains_nulls() though.

On 08/07/14 02:14, Tom Lane wrote:
Lastly, the spec defines behaviors for width_bucket that allow either
ascending or descending buckets. We could possibly do something
similar

I am not sure it's worth it here as we require input to be sorted
anyway. It might be worthwhile if we decided to do this as an aggregate
(since there input would not be required to be presorted) instead of
function but I am not sure if that's desirable either as that would
limit usability to only the single main use-case (group by and count()).

On 20/07/14 11:01, Simon Riggs wrote:

On 16 July 2014 20:35, Pavel Stehule <pavel.stehule@gmail.com> wrote:

On 08/07/14 02:14, Tom Lane wrote:

I didn't see any discussion of the naming question in this thread.
I'd like to propose that it should be just "width_bucket()"; we can
easily determine which function is meant, considering that the
SQL-spec variants don't take arrays and don't even have the same
number of actual arguments.

I did mention in submission that the names are up for discussion, I am all
for naming it just width_bucket.

I had this idea too - but I am not sure if it is good idea. A distance
between ANSI SQL with_bucket and our enhancing is larger than in our
implementation of "median" for example.

I can live with both names, but current name I prefer.

So I suggest that we use the more generic function name bin(), with a
second choice of discretize() (though that seems annoyingly easy to
spell incorrectly)

I really don't think bin() is good choice here, bucket has same meaning
in this context and bin is often used as shorthand for binary which
would be confusing here.

Anyway I currently left the name as it is, I would not be against using
width_bucket() or even just bucket(), not sure about the discretize()
option.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#11

Pavel Stehule

pavel.stehule@gmail.com

almost 12 years ago

In reply to: Petr Jelinek (#10)

Re: Built-in binning functions

I am looking to your last patch and I have a few questions, notes

1. I am thinking so reduction to only numeric types is not necessary -
although we can live without it - but there are lot of non numeric
categories: chars, date, ...

2. Still I strongly afraid about used searching method - there is not
possible to check a validity of input. Did you check how much linear
searching is slower - you spoke, so you have a real data and real use case?
Used methods can returns wrong result without any warning, what is
acceptable for extensions, but I am not sure, for core feature.

3. I am thinking about name - I don't think so varwidth_bucket is correct
-- in relation to name "width_bucket" ... what about "range_bucket" or
"scope_bucket" ?

last patch is very simple, there are no new compilation or regress tests
issues.

Regards

Pavel

2014-08-25 16:15 GMT+02:00 Petr Jelinek <petr@2ndquadrant.com>:

Show quoted text

Hi,

I finally had some time to get back to this.

I attached version3 of the patch which "fixes" Tom's complaint about int8
version by removing the int8 version as it does not seem necessary (the
float8 can handle integers just fine).

This patch now basically has just one optimized function for float8 and
one for numeric datatypes, just like width_bucket.

On 08/07/14 02:14, Tom Lane wrote:

Also, I'm not convinced by this business of throwing an error for a
NULL array element. Per spec, null arguments to width_bucket()
produce a null result, not an error --- shouldn't this flavor act
similarly? In any case I think the test needs to use
array_contains_nulls() not just ARR_HASNULL.

I really think there should be difference between NULL array and NULL
inside array and that NULL inside the array is wrong input. I changed the
check to array_contains_nulls() though.

On 08/07/14 02:14, Tom Lane wrote:

Lastly, the spec defines behaviors for width_bucket that allow either
ascending or descending buckets. We could possibly do something
similar

I am not sure it's worth it here as we require input to be sorted anyway.
It might be worthwhile if we decided to do this as an aggregate (since
there input would not be required to be presorted) instead of function but
I am not sure if that's desirable either as that would limit usability to
only the single main use-case (group by and count()).

On 20/07/14 11:01, Simon Riggs wrote:

On 16 July 2014 20:35, Pavel Stehule <pavel.stehule@gmail.com> wrote:

On 08/07/14 02:14, Tom Lane wrote:

I didn't see any discussion of the naming question in this thread.
I'd like to propose that it should be just "width_bucket()"; we can
easily determine which function is meant, considering that the
SQL-spec variants don't take arrays and don't even have the same
number of actual arguments.

I did mention in submission that the names are up for discussion, I am
all
for naming it just width_bucket.

I had this idea too - but I am not sure if it is good idea. A distance
between ANSI SQL with_bucket and our enhancing is larger than in our
implementation of "median" for example.

I can live with both names, but current name I prefer.

So I suggest that we use the more generic function name bin(), with a
second choice of discretize() (though that seems annoyingly easy to
spell incorrectly)

I really don't think bin() is good choice here, bucket has same meaning in
this context and bin is often used as shorthand for binary which would be
confusing here.

Anyway I currently left the name as it is, I would not be against using
width_bucket() or even just bucket(), not sure about the discretize()
option.

--
Petr Jelinek http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services

#12

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: Pavel Stehule (#11)

Re: Built-in binning functions

Pavel Stehule <pavel.stehule@gmail.com> writes:

1. I am thinking so reduction to only numeric types is not necessary -
although we can live without it - but there are lot of non numeric
categories: chars, date, ...

I wasn't terribly happy about that either. I still think we should
reduce this to a single polymorphic function, as in the attached.

2. Still I strongly afraid about used searching method - there is not
possible to check a validity of input. Did you check how much linear
searching is slower - you spoke, so you have a real data and real use case?

Well, if we check the input then we'll be doing O(N) comparisons instead
of O(log N). That could be a serious cost penalty for large arrays and
expensive comparison functions (such as strcoll()). I think it's probably
sufficient to have a clear warning in the docs.

3. I am thinking about name - I don't think so varwidth_bucket is correct
-- in relation to name "width_bucket" ... what about "range_bucket" or
"scope_bucket" ?

Neither of those seem like improvements from here. I agree with the
objections to bin() as well. bucket() might be all right but it still
seems a bit too generic. width_bucket(), which at least shows there's
a relationship to the standard functions, still seems like the best
of the suggestions so far.

It occurs to me that there would be an advantage to using some other
name besides width_bucket: we could then mark the function as variadic,
which might be a notational advantage in some usages. (We can't do
that if we call it width_bucket because the four-parameter case would
be ambiguous with the existing functions.) I'm not sure that this is
important enough to justify changing the name though, especially if
we can't come up with a good name. Also, doing that would put a very
large premium on picking a non-generic name, else we'd risk creating
ambiguities against user-defined functions.

Also, a documentation quibble: in Petr's patch the documentation and
comments refer to the thresholds as "right bounds", which I didn't care
for and replaced with "upper bounds". However, looking at it again,
I wonder if we would not be better off describing them as "lower bounds".
They are exclusive bounds if seen as upper bounds, and inclusive if seen
as lower bounds, and on the whole I think the latter viewpoint might be
less confusing. Thoughts?

Proposed patch with 1 polymorphic function attached.

regards, tom lane

#13

Simon Riggs

simon@2ndQuadrant.com

over 11 years ago

In reply to: Tom Lane (#12)

Re: Built-in binning functions

On 30 August 2014 18:24, Tom Lane <tgl@sss.pgh.pa.us> wrote:

3. I am thinking about name - I don't think so varwidth_bucket is correct
-- in relation to name "width_bucket" ... what about "range_bucket" or
"scope_bucket" ?

Neither of those seem like improvements from here. I agree with the
objections to bin() as well. bucket() might be all right but it still
seems a bit too generic. width_bucket(), which at least shows there's
a relationship to the standard functions, still seems like the best
of the suggestions so far.

I'd been mulling that over and agree with objections to bin()

Suggest discretize() as a much more informative name. The other names
will be overlooked by anybody that doesn't already know what to look
for.

Thanks for the polymorphic patch, that sounds good. Sorry, not
reviewed more deeply (still travelling).

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: Simon Riggs (#13)

Re: Built-in binning functions

Simon Riggs <simon@2ndquadrant.com> writes:

Suggest discretize() as a much more informative name. The other names
will be overlooked by anybody that doesn't already know what to look
for.

I did not like that idea to begin with, but it's growing more attractive.
In particular, I think it would be unique enough that we could safely
mark the polymorphic function as variadic (a move that would cause
ambiguity issues for pretty nearly any user-defined function of the
same name).

OTOH, I'm not as convinced as you that this name would mean anything
to "anybody that doesn't already know what to look for". It still
seems like rather arcane terminology.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Gavin Flower

GavinFlower@archidevsys.co.nz

over 11 years ago

In reply to: Tom Lane (#14)

Re: Built-in binning functions

On 01/09/14 06:00, Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

Suggest discretize() as a much more informative name. The other names
will be overlooked by anybody that doesn't already know what to look
for.

I did not like that idea to begin with, but it's growing more attractive.
In particular, I think it would be unique enough that we could safely
mark the polymorphic function as variadic (a move that would cause
ambiguity issues for pretty nearly any user-defined function of the
same name).

OTOH, I'm not as convinced as you that this name would mean anything
to "anybody that doesn't already know what to look for". It still
seems like rather arcane terminology.

regards, tom lane

If I was new to PostgreSQL, I'd probably read this as disc*(), a
function to do something with discs.

I have done finite maths and statistics at university, plus I have been
vaguely following this thread - but, the meaning of discretize() is not
immediately apparent to me (if I reread a previous email or 2 in this
thread, then I'm sure it would then be obvious!). I might guess that it
is to make something discrete, but why would I want to do that? So if I
came across this function in a year or 2, I'd probably go right past it.

I have been programming for over 40 years, and I still find choosing
appropriate names sometimes very difficult, so I know it is not easy to
come up with meaningful names - especially ones that mean something
relevant to other people!

Cheers,
Gavin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Petr Jelinek

petr@2ndquadrant.com

over 11 years ago

In reply to: Tom Lane (#12)

Re: Built-in binning functions

On 30/08/14 19:24, Tom Lane wrote:

Pavel Stehule <pavel.stehule@gmail.com> writes:

1. I am thinking so reduction to only numeric types is not necessary -
although we can live without it - but there are lot of non numeric
categories: chars, date, ...

I wasn't terribly happy about that either. I still think we should
reduce this to a single polymorphic function, as in the attached.

I did try to write generic function very similar to what you wrote but
discarded it because of the performance reasons. If we really want to
support any data type I am all for having generic function but I still
would like to have one optimized for float8 because that seems to be the
most used use-case (at least that one is the reason why I even wrote
this) for performance reasons.

Here are some numbers:
First float8:
CREATE TABLE wbtest AS SELECT (random() * 100)::float8 a FROM
generate_series(1,1000000) ORDER BY 1;

SELECT
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 90]::float8[]),
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 91]::float8[]),
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 92]::float8[])
FROM wbtest;

Optimized float8: ~250ms
Tom's generic: ~400ms

Numeric:
CREATE TABLE wbtestn AS SELECT (random() * 100)::numeric a FROM
generate_series(1,1000000) ORDER BY 1;

SELECT
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 90]::numeric[]),
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 91]::numeric[]),
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 92]::numeric[])
FROM wbtestn;

Optimized numeric: ~600ms
My generic: ~780ms
Tom's generic: ~900ms

The difference between my generic and Tom's generic is because Tom's is
slowed down by the deconstruct_array.

2. Still I strongly afraid about used searching method - there is not
possible to check a validity of input. Did you check how much linear
searching is slower - you spoke, so you have a real data and real use case?

Well, if we check the input then we'll be doing O(N) comparisons instead
of O(log N). That could be a serious cost penalty for large arrays and
expensive comparison functions (such as strcoll()). I think it's probably
sufficient to have a clear warning in the docs.

With resort the performance is worse than bunch of CASE WHEN :(

Also, a documentation quibble: in Petr's patch the documentation and
comments refer to the thresholds as "right bounds", which I didn't care
for and replaced with "upper bounds". However, looking at it again,
I wonder if we would not be better off describing them as "lower bounds".
They are exclusive bounds if seen as upper bounds, and inclusive if seen
as lower bounds, and on the whole I think the latter viewpoint might be
less confusing. Thoughts?

Upper bounds is probably better name than right bounds, I agree with
that. In any case it's upper bound for a bucket (that's why there is one
more bucket to which everything bigger than max threshold goes into).

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: Petr Jelinek (#16)

Re: Built-in binning functions

Petr Jelinek <petr@2ndquadrant.com> writes:

On 30/08/14 19:24, Tom Lane wrote:

I wasn't terribly happy about that either. I still think we should
reduce this to a single polymorphic function, as in the attached.

I did try to write generic function very similar to what you wrote but
discarded it because of the performance reasons. If we really want to
support any data type I am all for having generic function but I still
would like to have one optimized for float8 because that seems to be the
most used use-case (at least that one is the reason why I even wrote
this) for performance reasons.

Well, part of the reason why your v3 float8 function looks so fast is that
it's cheating: it will not produce the right answers for comparisons
involving NaNs. I'm not sure how good it would look once you'd added
some isnan() tests to make the behavior actually equivalent to
float8_cmp_internal().

However, assuming it still seems worthwhile once that's accounted for,
I don't have a strong objection to having an additional code path for
float8. There are two ways we could do that:

1. Add a runtime test in the polymorphic function, eg

if (element_type == FLOAT8OID)
result = width_bucket_float8(...);
else if (typentry->typlen > 0)
result = width_bucket_fixed(...);
else
result = width_bucket_variable(...);

2. Define a SQL function width_bucket(float8, float8[]) alongside
the polymorphic one.

While #2 might be marginally faster at runtime, the main difference
between these is what the parser would choose to do with ambiguous cases.

I experimented a bit and it seemed that the parser would prefer the
float8 function in any situation where the inputs were of numeric types,
probably because float8 is the preferred type in the numeric category.
I'm not real sure that would be a good thing: in particular it would
cast integer and numeric inputs to float8, which risks precision loss,
and there doesn't seem to be any way to get the parser to make the
other choice if the inputs are of numeric-category types.

As against that objection, it would make cross-type numeric cases easier
to use, for example "width_bucket(1, array[2.4])" would be accepted.
If we just offer the anyelement/anyarray version then the parser would
make you cast "1" to numeric before it would consider the function to
match.

On balance the runtime-test approach looks like a better idea, in that
it doesn't risk any unexpected semantic behaviors.

The difference between my generic and Tom's generic is because Tom's is
slowed down by the deconstruct_array.

Meh. It looked to me like your version would have O(N^2) performance
problems from computing array offsets repeatedly, depending on exactly
which array element it ended up on. It would avoid repeat calculations
only if it always moved right.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Simon Riggs

simon@2ndQuadrant.com

over 11 years ago

In reply to: Gavin Flower (#15)

Re: Built-in binning functions

On 31 August 2014 20:44, Gavin Flower <GavinFlower@archidevsys.co.nz> wrote:

On 01/09/14 06:00, Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

Suggest discretize() as a much more informative name. The other names
will be overlooked by anybody that doesn't already know what to look
for.

I did not like that idea to begin with, but it's growing more attractive.
In particular, I think it would be unique enough that we could safely
mark the polymorphic function as variadic (a move that would cause
ambiguity issues for pretty nearly any user-defined function of the
same name).

OTOH, I'm not as convinced as you that this name would mean anything
to "anybody that doesn't already know what to look for". It still
seems like rather arcane terminology.

If I was new to PostgreSQL, I'd probably read this as disc*(), a function to
do something with discs.

I have done finite maths and statistics at university, plus I have been
vaguely following this thread - but, the meaning of discretize() is not
immediately apparent to me (if I reread a previous email or 2 in this
thread, then I'm sure it would then be obvious!). I might guess that it is
to make something discrete, but why would I want to do that? So if I came
across this function in a year or 2, I'd probably go right past it.

I have been programming for over 40 years, and I still find choosing
appropriate names sometimes very difficult, so I know it is not easy to come
up with meaningful names - especially ones that mean something relevant to
other people!

It's important that we *don't* come up with a new name.

This function does something useful and the words people already use
to describe that are binning and discretization. By this I mean names
already used by very widely used software. Search them and see, or
suggest more widely used alternatives, please.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Petr Jelinek

petr@2ndquadrant.com

over 11 years ago

In reply to: Tom Lane (#17)

Re: Built-in binning functions

On 31/08/14 22:33, Tom Lane wrote:

Petr Jelinek <petr@2ndquadrant.com> writes:

On 30/08/14 19:24, Tom Lane wrote:

I wasn't terribly happy about that either. I still think we should
reduce this to a single polymorphic function, as in the attached.

I did try to write generic function very similar to what you wrote but
discarded it because of the performance reasons. If we really want to
support any data type I am all for having generic function but I still
would like to have one optimized for float8 because that seems to be the
most used use-case (at least that one is the reason why I even wrote
this) for performance reasons.

Well, part of the reason why your v3 float8 function looks so fast is that
it's cheating: it will not produce the right answers for comparisons
involving NaNs. I'm not sure how good it would look once you'd added
some isnan() tests to make the behavior actually equivalent to
float8_cmp_internal().

True, it increases the time of the test to around 285ms, still almost
30% speed difference.

However, assuming it still seems worthwhile once that's accounted for,
I don't have a strong objection to having an additional code path for
float8. There are two ways we could do that:

1. Add a runtime test in the polymorphic function, eg

if (element_type == FLOAT8OID)
result = width_bucket_float8(...);
else if (typentry->typlen > 0)
result = width_bucket_fixed(...);
else
result = width_bucket_variable(...);

Yeah I think I prefer this one too, will see how much performance it eats.

The difference between my generic and Tom's generic is because Tom's is
slowed down by the deconstruct_array.

Meh. It looked to me like your version would have O(N^2) performance
problems from computing array offsets repeatedly, depending on exactly
which array element it ended up on. It would avoid repeat calculations
only if it always moved right.

I actually think that worst case (when you go always left) for my
version is O(N) since you only need to seek for the half of previous
interval (it's doing binary search after all) and you do O(N) in the
deconstruct_array. It would be very different if we could cache the
array somehow (ie, if this was an aggregate) then it would obviously
make a lot of sense to use deconstruct_array and in that case it would
make even sense to resort probably, but sadly we can't cache afaik.

Also, I made more tests with various array sizes (3-10000) and
distributions and mine version was never slower.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

David G. Johnston

david.g.johnston@gmail.com

over 11 years ago

In reply to: Simon Riggs (#9)

Re: Built-in binning functions

Simon Riggs wrote

width_bucket() seems to refer to an equal-width binning process. The
function being discussed here is a generic mechanism, the boundaries
of which could have been decided using equal-frequency or other
mechanisms. Using the word "width" in those contexts could be
confusing.

Given width_bucket() is already in use for SQL Standard function, I
agree it would be confusing to have same name for different parameter
profiles.

literal_bucket(float8, float8[])

Since "bucket" is the 'verb' here (in this specific case meaning "lookup the
supplied value in the supplied bucket definition") and "width" is a modifier
(the bucket specification describes an equal-width structure) I suggest
"literal_bucket(val, array[])" such that the bucket is still the verb but
now the modifier describes a structure that is literally provided.

David J.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Built-in-binning-functions-tp5807266p5817097.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: Petr Jelinek (#19)

#22

Tom Lane

tgl@sss.pgh.pa.us

over 11 years ago

In reply to: David G. Johnston (#20)

#23

David G. Johnston