Do we need multiple forms of the SQL2003 statistics aggregates?

Started by Tom Laneover 19 years ago7 messages

tgl@sss.pgh.pa.us

over 19 years ago

Sergey's recent patch for the SQL2003 binary aggregates implements each
of them three times: in float4, float8, and numeric arithmetic. This
seems like vast overkill. The float4 versions certainly ought to go,
as they won't perform noticeably better than float8 and will likely be
subject to terrible roundoff-error problems. I'm also dubious about the
numeric versions. The spec says

b) Let DTDVE be the declared type of DVE and let DTIVE be the
declared type of IVE.

c) Case:

i) The declared type of REGR_COUNT is an implementation-defined
exact numeric type with scale of 0 (zero).

ii) Otherwise, the declared type of the result is an
implementation-defined approximate numeric type. If DTDVE is an
approximate numeric type, then the precision of the result is
not less than the precision of DTDVE. If DTIVE is an approximate
numeric type, then the precision of the result is not less than
the precision of DTIVE.

so as far as I can see we are *only* required to provide a float8
implementation.

There is room to argue that the numeric-arithmetic version would be
worth having on the grounds of greater precision or range, but it's a
big chunk of code and the public demand for the functionality has not
exactly been overwhelming.

Comments?

regards, tom lane

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Tom Lane (#1)

Re: Do we need multiple forms of the SQL2003 statistics aggregates?

I wrote:

There is room to argue that the numeric-arithmetic version would be
worth having on the grounds of greater precision or range, but it's a
big chunk of code and the public demand for the functionality has not
exactly been overwhelming.

Comments?

Since no one's even bothered to respond, I take it there's insufficient
interest in the numeric versions of these aggregates. I've committed
just the float8 versions.

I added some very trivial regression tests, which we'll have to keep an
eye on to see if they have any portability problems. We may need to
back off the number of displayed fraction digits to get them to pass
everywhere.

If anyone wants to do better tests, feel free...

regards, tom lane

Sergey E. Koposov

math@sai.msu.ru

over 19 years ago

In reply to: Tom Lane (#2)

Re: Do we need multiple forms of the SQL2003 statistics

Tom, thank you for the reviewing/correcting/applying my patches...

On Fri, 28 Jul 2006, Tom Lane wrote:

I wrote:

There is room to argue that the numeric-arithmetic version would be
worth having on the grounds of greater precision or range, but it's a
big chunk of code and the public demand for the functionality has not
exactly been overwhelming.

Comments?

Since no one's even bothered to respond, I take it there's insufficient
interest in the numeric versions of these aggregates. I've committed
just the float8 versions.

My opinion on that is
1) I agree that really the float4 version are useless, because they don't
offer anything new... comparing to the float8
2) But I think the numeric versions of these functions are not useless...
(if somebody have numeric columns... )
I don't think the 10-15 additional functions in pg_proc is too much.
Concerning to the amount of code in numeric.c I think it is large, but not
complicated at all, so I don't think that it will be a problem to support
that code.

I think since we are supporting the numeric type as a special
high-precision type, Postgres must have the high-precision
versions of all computational functions. Just my opinion.

I added some very trivial regression tests, which we'll have to keep an
eye on to see if they have any portability problems. We may need to
back off the number of displayed fraction digits to get them to pass
everywhere.

If anyone wants to do better tests, feel free...

I will try to write some better tests and send a patch.

Regards,
Sergey

*******************************************************************
Sergey E. Koposov
Max Planck Institute for Astronomy/Sternberg Astronomical Institute
Tel: +49-6221-528-349
Web: http://lnfm1.sai.msu.ru/~math
E-mail: math@sai.msu.ru

Peter Eisentraut

peter_e@gmx.net

over 19 years ago

In reply to: Sergey E. Koposov (#3)

Re: Do we need multiple forms of the SQL2003 statistics

Sergey E. Koposov wrote:

I think since we are supporting the numeric type as a special
high-precision type, Postgres must have the high-precision
versions of all computational functions. Just my opinion.

Another way to look at it is whether you want to have accurate
computations (numeric) or approximate computations (float). I'm not a
statistician, so I don't know what most of these functions are used
for. From a mathematician's point of view, however, some of these
functions normally produce irrational numbers anyway, so it seems
unlikely that numeric will be useful. But looking at the definition
of, say, regr_avgx(Y, X), if all the input values are integers, it
might be useful if I could get an exact integer or rational number as
output, instead of a float, that is.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Sergey E. Koposov

math@sai.msu.ru

over 19 years ago

In reply to: Peter Eisentraut (#4)

Re: Do we need multiple forms of the SQL2003 statistics

On Sat, 29 Jul 2006, Peter Eisentraut wrote:

Sergey E. Koposov wrote:

I think since we are supporting the numeric type as a special
high-precision type, Postgres must have the high-precision
versions of all computational functions. Just my opinion.

Another way to look at it is whether you want to have accurate
computations (numeric) or approximate computations (float). I'm not a
statistician, so I don't know what most of these functions are used
for. From a mathematician's point of view, however, some of these
functions normally produce irrational numbers anyway, so it seems
unlikely that numeric will be useful. But looking at the definition
of, say, regr_avgx(Y, X), if all the input values are integers, it
might be useful if I could get an exact integer or rational number as
output, instead of a float, that is.

Exactly from the statistical point of view, there is no need to have the
integer output of those 2-arg. aggregates. For example corr(), regr_*()
are by definition not integer (they just don't have any sense as
integers...)( -1<= corr(Y,X)<=1 ) (for example the stddev(int) do
not produce int also, because it does not have any sense)

So it's perfectly fine that they are producing only floating numbers...

Regards,
Sergey

*******************************************************************
Sergey E. Koposov
Max Planck Institute for Astronomy/Sternberg Astronomical Institute
Tel: +49-6221-528-349
Web: http://lnfm1.sai.msu.ru/~math
E-mail: math@sai.msu.ru

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Peter Eisentraut (#4)

Re: Do we need multiple forms of the SQL2003 statistics

Peter Eisentraut <peter_e@gmx.net> writes:

... From a mathematician's point of view, however, some of these
functions normally produce irrational numbers anyway, so it seems
unlikely that numeric will be useful. But looking at the definition
of, say, regr_avgx(Y, X), if all the input values are integers, it
might be useful if I could get an exact integer or rational number as
output, instead of a float, that is.

The question is whether this is useful enough to justify adding a
thousand lines to numeric.c.

I believe also that the numeric forms of the aggregates will be
enormously slower than the float forms, at least on most modern
machines with decent floating-point performance. I don't have time
to do any performance testing though.

regards, tom lane

Sergey E. Koposov

math@sai.msu.ru

over 19 years ago

In reply to: Tom Lane (#6)

Re: Do we need multiple forms of the SQL2003 statistics

On Sat, 29 Jul 2006, Tom Lane wrote:

Peter Eisentraut <peter_e@gmx.net> writes:

... From a mathematician's point of view, however, some of these
functions normally produce irrational numbers anyway, so it seems
unlikely that numeric will be useful. But looking at the definition
of, say, regr_avgx(Y, X), if all the input values are integers, it
might be useful if I could get an exact integer or rational number as
output, instead of a float, that is.

The question is whether this is useful enough to justify adding a
thousand lines to numeric.c.

I believe also that the numeric forms of the aggregates will be
enormously slower than the float forms, at least on most modern
machines with decent floating-point performance. I don't have time
to do any performance testing though.

Based on just simple tests On my laptop (1.8Ghz Centrino) the numeric
versions are 5-10 times slower (depending on what 2arg function is used)

The example:

template1=# \d test
Table "public.test"
Column | Type | Modifiers
--------+------------------+-----------
na | numeric |
nb | numeric |
da | double precision |
db | double precision |
template1=# select count(*) from test;
count
---------
1000000
(1 row)

template1=# select regr_avgx(da,db) from test;
regr_avgx
-----------------
0.5002412120227
(1 row)

Time: 1052,893 ms
template1=# select regr_avgx(na,nb) from test;
regr_avgx
-------------------------
0.500040167263887822939
(1 row)

Time: 4459,739 ms

template1=# select regr_sxx(da,db) from test;
regr_sxx
------------------
83303.6317359119
(1 row)

Time: 1043,891 ms
template1=# select regr_sxx(na,nb) from test;
regr_sxx
--------------------------------------------------
83342.044294954140912267902323775495680113567986
(1 row)

Time: 8514,843 ms
template1=# select corr(da,db) from test;
corr
----------------------
0.000527588261283456
(1 row)

Time: 1074,948 ms
template1=# select corr(na,nb) from test;
corr
----------------------------------------------------------------------------------------

0.000759857150984988517883855238363403977440313567465424735082001422354119457463407737
(1 row)

Time: 18327,376 ms

Yes, that's significantly slower, but I don't think that it make the
functions unusable...

Regards,
Sergey

*******************************************************************
Sergey E. Koposov
Max Planck Institute for Astronomy/Sternberg Astronomical Institute
Tel: +49-6221-528-349
Web: http://lnfm1.sai.msu.ru/~math
E-mail: math@sai.msu.ru